COMPUTER-VISION2026-04-15

RobotPan: A 360° Surround-View Robotic Vision System for Embodied Perception

Jiahao Ma, Qiang Zhang, Peiran Liu, Zeran Su, Pihai Sun, Gang Han, Wen Zhao, Wei Cui, Zhang Zhang, Zhiyuan Xu, Renjing Xu, Jian Tang, Miaomiao Liu, Yijie Guo

RobotPan solves one of the most frustrating problems in Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations.: giving operators a clear, immersive view of everything happening around the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. in real-time. Imagine remotely controlling a humanoid Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. through a narrow camera feed—you'd miss critical context about obstacles, nearby people, and Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening. details happening just outside the frame. RobotPan uses six cameras arranged in a ring around the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s head (spaced 60° apart) plus a central Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. to create a seamless 360° surround view. The system predicts compact 3D Gaussians—a modern 3D representation that's lightweight and fast to render—directly from these sparse camera inputs, enabling real-time streaming to operators' displays without motion sickness, manual camera switching, or laggy jitter. This is deployed on the Tiangong 3.0 humanoid Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. and works during complex dynamic tasks like jumping and full-body Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects..

ARCHITECTURE

THE PROBLEM

Before RobotPan, Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. operators faced a trilemma: (1) Narrow forward-facing cameras leave critical blind spots—you can't see what's happening to the sides or behind during Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. or Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running.. (2) Multiple on-board cameras require manual switching (flipping between front/side/back views), which breaks operator flow and wastes cognitive bandwidth, especially in Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. where seconds matter. (3) Stitching multiple camera feeds together or using wide-angle lenses introduces geometric distortion and motion-induced jitter that causes simulator sickness when viewed through head-mounted displays—a serious problem when operators need to stay sharp for emergency takeover. Prior methods either focused on single-view rendering (missing the surround context), required expensive Robot LearningTrainingThe process of fitting a model using data or experience. for every scene (too slow for real-time Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.), or produced outputs with too many 3D primitives to stream and render at 30+ fps on bandwidth-constrained robotic systems.

HOW IT WORKS

1

Hardware: Six-Camera + LiDAR Ring Configuration

The physical foundation is elegant: six RGB cameras mounted at 60° intervals around the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s head form a ring that sees everything without blind spots, paired with a central Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. for depth grounding. This is a constraint-aware design—it's compact enough to fit on a humanoid head, the 60° spacing ensures sufficient overlap for geometric consistency, and the Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. provides metric-scale anchoring (critical for Manipulation & TasksGraspingTaking hold of an object. and Navigation & LocomotionNavigationMoving through an environment toward a goal. where relative distances matter). The sparse overlap between adjacent cameras is actually an advantage: it means you can use feed-forward prediction (no test-time optimization) rather than slow iterative methods, keeping Simulation & Sim-to-RealLatencyDelay between input, computation, and action. under Control & PlanningControlThe method used to make the robot move the way you want. for real-time Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations..

2

Spherical Coordinate Lifting: Unifying Multi-View Features

The key technical innovation is how RobotPan handles the 360° geometry. Instead of treating each camera's view separately, features from all six cameras are lifted into a unified spherical coordinate system centered on the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.. Think of it like projecting everything onto the inside of a sphere around the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'s head. This unified representation makes it natural to reason about what's nearby (fine detail needed) versus far away (coarse detail sufficient). It also ensures geometric consistency across the six separate camera views automatically—you can't have conflicting geometry when everything shares the same coordinate frame.

3

Hierarchical Spherical Voxel Priors: Smart Resolution Allocation

Once features are in spherical coordinates, RobotPan decodes them into 3D Gaussians using hierarchical spherical voxel priors. The crucial word is 'hierarchical': near the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. (within arm's reach and the operator's focus area), voxels are fine-grained to capture Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. details with high fidelity. As radius increases outward, voxels coarsen progressively—distant background gets less detail. This hierarchy cuts the number of 3D Gaussians needed by a substantial margin (the paper emphasizes 'substantially fewer Gaussians') compared to uniform grids, making streaming and rendering fast enough for real-time operation. It's a principle you'll see across robotics: allocate computational resources where they matter for the Core ConceptsTaskThe job the robot is supposed to complete, such as pick-and-place, navigation, or drawer opening..

4

Online Fusion with Selective Appearance Updates

Real Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. sequences are long, and memory explodes if you naively accumulate Gaussians from every frame. RobotPan's online fusion strategy is clever: it updates dynamic content (moving parts of the scene) across frames but prevents unbounded growth in static regions by selectively refreshing only the appearance (color/texture) of static areas rather than adding new geometry. This keeps the Gaussian count stable even over dozens of seconds, essential for sustained Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations.. The selective update mechanism learns what to reuse and what to refresh—similar ideas are used in real-time Navigation & LocomotionSLAMSimultaneous Localization and Mapping. systems.

5

LiDAR Grounding for Metric Reconstruction

The Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. isn't just decorative—it provides metric-scale supervision. Computer vision alone struggles to recover absolute scale (is that object 1 meter or 10 meters away?), but Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. depth is metrically accurate. By anchoring the Gaussian predictions to Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. depth, RobotPan produces metric-scaled 3D reconstructions that operators and downstream Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. modules can trust for real geometric reasoning. This is non-trivial: fusing six image views with sparse Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. points into a consistent Evaluation & ResearchMetricA numerical measure of performance. scene is a primary engineering challenge the paper addresses.

demo scene overview
demo scene overview with title
Tiangong2Dex

KEY RESULTS

Gaussian Count ReductionSubstantially fewer 3D Gaussians than prior methods

vs. feed-forward baselines that don't use hierarchical voxel allocation

Fewer Gaussians mean lower bandwidth for streaming, faster rendering on operator headsets, and less Simulation & Sim-to-RealLatencyDelay between input, computation, and action. in the Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. loop. This directly enables real-time Simulation & Sim-to-RealDeploymentPutting the trained system on a real robot.; prior methods that produce denser Gaussian scenes would cause noticeable lag.

360° Coverage with Real-Time RenderingFull surround-view rendering at interaction-frame rates

vs. single forward-facing cameras or cumbersome manual switching between multiple views

Operators see everything simultaneously without jitter or manual Safety & DeploymentInterventionA human or safety system stepping in during robot operation.. This eliminates the 'blind spot' problem and dramatically reduces cognitive load during Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations., especially critical during high-stakes tasks like emergency takeover or Manipulation & TasksDexterous manipulationHighly precise object handling, usually with fingers or complex contact. near obstacles.

Competitive Reconstruction QualityMatching or exceeding prior feed-forward and view-synthesis methods

vs. prior optimization-based (slow) and feed-forward (geometry-naive) approaches

Despite using fewer Gaussians and running in real-time, RobotPan doesn't sacrifice visual fidelity. This is the crux of the contribution: you get fast AND accurate, not a compromise between them. Operators see crisp, geometrically consistent views even during dynamic Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. motion.

Metric-Scaled Depth AccuracyGround-truth metric reconstruction via LiDAR-grounded Gaussians

vs. monocular or multi-view methods without metric anchoring

Absolute scale matters for robotics. A Manipulation & TasksGraspingTaking hold of an object. Core ConceptsPolicyThe rule or model that maps observations or states to actions. needs to know exactly how far an object is. The Evaluation & ResearchMetricA numerical measure of performance. grounding means downstream Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. and Navigation & LocomotionNavigationMoving through an environment toward a goal. modules can directly use RobotPan's output without further calibration.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. software, RobotPan changes what's possible. Instead of writing code to switch between three camera feeds and apologizing to operators about blind spots, you now have a single, unified 360° view that streams in real-time. For developers, this means you can assume complete spatial awareness in your UI and decision-making logic—no more 'was there an obstacle behind the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.?' uncertainty. The spherical coordinate system is also a lesson: when you have strong geometric constraints (like 'everything lives on a sphere around the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions.'), exploit them in your representation. Hierarchical allocation of compute is another principle to internalize—allocating fine detail near the agent and coarse detail far away is a pattern you'll see in rendering engines, Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested., and Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. systems across robotics. For data collection and Simulation & Sim-to-RealSimulationA virtual environment where robots can be trained or tested.: RobotPan releases a multi-sensor Robot LearningDatasetA collection of training or evaluation data. for 360° novel view synthesis and Evaluation & ResearchMetricA numerical measure of performance. reconstruction covering real Navigation & LocomotionNavigationMoving through an environment toward a goal., Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects., and Navigation & LocomotionLocomotionMovement of the robot body through space, like walking, rolling, or running. tasks. This is a gift to the community—Robot LearningTrainingThe process of fitting a model using data or experience. your own surround-view Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. models becomes feasible. Finally, the paper demonstrates that feed-forward prediction (no test-time optimization) can be competitive with slower methods if you design the architecture right. This is critical for robotics, where Simulation & Sim-to-RealLatencyDelay between input, computation, and action. kills Simulation & Sim-to-RealReal-time controlProducing actions fast enough for live robot control..

LIMITATIONS

The paper doesn't explicitly detail failure modes, but implied limitations include: (1) The six-camera ring is bespoke hardware—retrofitting it to arbitrary robots requires mechanical integration work. (2) The method assumes calibrated, overlapping camera views; calibration drift or camera damage would degrade performance. (3) The hierarchical voxel structure allocates detail based on radius from the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. center, which works well for a mobile manipulator but might not generalize to very different morphologies. (4) Online fusion with selective updates assumes relatively slow scene change; rapid dynamic scenes (multiple moving people) might cause Gaussian coherence issues. (5) Real-time performance claims likely depend on specific hardware (GPU availability on the Core ConceptsRobotA physical system with sensors and actuators that can observe the world and take actions. or a companion streaming computer); Simulation & Sim-to-RealLatencyDelay between input, computation, and action. budgets aren't quantified in the available materials.

WHAT COMES NEXT

Future versions will likely tackle: (1) Pushing toward higher-resolution rendering and finer geometric detail per Gaussian while keeping the count constant, leveraging recent advances in 3D Gaussian splatting. (2) Extending to multi-robot scenarios where multiple humanoids coordinate—a single operator needs surround views of multiple agents simultaneously. (3) Tighter integration with Manipulation & TasksManipulationUsing a robot arm or hand to move or interact with objects. and Navigation & LocomotionNavigationMoving through an environment toward a goal. policies—using RobotPan's Evaluation & ResearchMetricA numerical measure of performance. reconstruction as direct input to learned Control & PlanningControlThe method used to make the robot move the way you want. policies rather than just for visualization. (4) Reducing hardware requirements: current six-camera + Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. setup is expensive; future work might achieve similar results with fewer, cheaper sensors. (5) Handling extreme lighting conditions and transparency (glass obstacles, reflections) which currently challenge Perception & SensingLidarA sensor that measures distance using laser light, often used in mapping and navigation. and camera-based systems equally. The broader Core ConceptsTrajectoryA sequence of states or actions over time. is toward Core ConceptsEmbodied AIAI that can perceive, reason, and act in the physical world through a body, like a robot. where Perception & SensingPerceptionThe process of turning raw sensor data into useful understanding of the world. and Control & PlanningControlThe method used to make the robot move the way you want. are co-designed: RobotPan's success suggests that a holistic Perception & SensingSensorA device that provides information about the robot or its environment./software stack designed specifically for Safety & DeploymentHuman-in-the-loopA workflow where humans guide, monitor, or correct the robot. Imitation & Reinforcement LearningTeleoperation (teleop)A human remotely controlling the robot, often to collect demonstrations. outperforms piecemeal approaches.

RELATED PAPERS