COMPUTER-VISION2026-04-15

RobotPan: A 360° Surround-View Robotic Vision System for Embodied Perception

Jiahao Ma, Qiang Zhang, Peiran Liu, Zeran Su, Pihai Sun, Gang Han, Wen Zhao, Wei Cui, Zhang Zhang, Zhiyuan Xu, Renjing Xu, Jian Tang, Miaomiao Liu, Yijie Guo

RobotPan solves one of the most frustrating problems in : giving operators a clear, immersive view of everything happening around the in real-time. Imagine remotely controlling a humanoid through a narrow camera feed—you'd miss critical context about obstacles, nearby people, and details happening just outside the frame. RobotPan uses six cameras arranged in a ring around the 's head (spaced 60° apart) plus a central to create a seamless 360° surround view. The system predicts compact 3D Gaussians—a modern 3D representation that's lightweight and fast to render—directly from these sparse camera inputs, enabling real-time streaming to operators' displays without motion sickness, manual camera switching, or laggy jitter. This is deployed on the Tiangong 3.0 humanoid and works during complex dynamic tasks like jumping and full-body .

ARCHITECTURE

THE PROBLEM

Before RobotPan, operators faced a trilemma: (1) Narrow forward-facing cameras leave critical blind spots—you can't see what's happening to the sides or behind during or . (2) Multiple on-board cameras require manual switching (flipping between front/side/back views), which breaks operator flow and wastes cognitive bandwidth, especially in where seconds matter. (3) Stitching multiple camera feeds together or using wide-angle lenses introduces geometric distortion and motion-induced jitter that causes simulator sickness when viewed through head-mounted displays—a serious problem when operators need to stay sharp for emergency takeover. Prior methods either focused on single-view rendering (missing the surround context), required expensive for every scene (too slow for real-time ), or produced outputs with too many 3D primitives to stream and render at 30+ fps on bandwidth-constrained robotic systems.

HOW IT WORKS

Hardware: Six-Camera + LiDAR Ring Configuration

The physical foundation is elegant: six RGB cameras mounted at 60° intervals around the 's head form a ring that sees everything without blind spots, paired with a central for depth grounding. This is a constraint-aware design—it's compact enough to fit on a humanoid head, the 60° spacing ensures sufficient overlap for geometric consistency, and the provides metric-scale anchoring (critical for and where relative distances matter). The sparse overlap between adjacent cameras is actually an advantage: it means you can use feed-forward prediction (no test-time optimization) rather than slow iterative methods, keeping under for real-time .

Spherical Coordinate Lifting: Unifying Multi-View Features

The key technical innovation is how RobotPan handles the 360° geometry. Instead of treating each camera's view separately, features from all six cameras are lifted into a unified spherical coordinate system centered on the . Think of it like projecting everything onto the inside of a sphere around the 's head. This unified representation makes it natural to reason about what's nearby (fine detail needed) versus far away (coarse detail sufficient). It also ensures geometric consistency across the six separate camera views automatically—you can't have conflicting geometry when everything shares the same coordinate frame.

Hierarchical Spherical Voxel Priors: Smart Resolution Allocation

Once features are in spherical coordinates, RobotPan decodes them into 3D Gaussians using hierarchical spherical voxel priors. The crucial word is 'hierarchical': near the (within arm's reach and the operator's focus area), voxels are fine-grained to capture details with high fidelity. As radius increases outward, voxels coarsen progressively—distant background gets less detail. This hierarchy cuts the number of 3D Gaussians needed by a substantial margin (the paper emphasizes 'substantially fewer Gaussians') compared to uniform grids, making streaming and rendering fast enough for real-time operation. It's a principle you'll see across robotics: allocate computational resources where they matter for the .

Online Fusion with Selective Appearance Updates

Real sequences are long, and memory explodes if you naively accumulate Gaussians from every frame. RobotPan's online fusion strategy is clever: it updates dynamic content (moving parts of the scene) across frames but prevents unbounded growth in static regions by selectively refreshing only the appearance (color/texture) of static areas rather than adding new geometry. This keeps the Gaussian count stable even over dozens of seconds, essential for sustained . The selective update mechanism learns what to reuse and what to refresh—similar ideas are used in real-time systems.

LiDAR Grounding for Metric Reconstruction

The isn't just decorative—it provides metric-scale supervision. Computer vision alone struggles to recover absolute scale (is that object 1 meter or 10 meters away?), but depth is metrically accurate. By anchoring the Gaussian predictions to depth, RobotPan produces metric-scaled 3D reconstructions that operators and downstream modules can trust for real geometric reasoning. This is non-trivial: fusing six image views with sparse points into a consistent scene is a primary engineering challenge the paper addresses.

demo scene overview

demo scene overview with title

Tiangong2Dex

KEY RESULTS

Gaussian Count ReductionSubstantially fewer 3D Gaussians than prior methods

vs. feed-forward baselines that don't use hierarchical voxel allocation

Fewer Gaussians mean lower bandwidth for streaming, faster rendering on operator headsets, and less in the loop. This directly enables real-time ; prior methods that produce denser Gaussian scenes would cause noticeable lag.

360° Coverage with Real-Time RenderingFull surround-view rendering at interaction-frame rates

vs. single forward-facing cameras or cumbersome manual switching between multiple views

Operators see everything simultaneously without jitter or manual . This eliminates the 'blind spot' problem and dramatically reduces cognitive load during , especially critical during high-stakes tasks like emergency takeover or near obstacles.

Competitive Reconstruction QualityMatching or exceeding prior feed-forward and view-synthesis methods

vs. prior optimization-based (slow) and feed-forward (geometry-naive) approaches

Despite using fewer Gaussians and running in real-time, RobotPan doesn't sacrifice visual fidelity. This is the crux of the contribution: you get fast AND accurate, not a compromise between them. Operators see crisp, geometrically consistent views even during dynamic motion.

Metric-Scaled Depth AccuracyGround-truth metric reconstruction via LiDAR-grounded Gaussians

vs. monocular or multi-view methods without metric anchoring

Absolute scale matters for robotics. A needs to know exactly how far an object is. The grounding means downstream and modules can directly use RobotPan's output without further calibration.

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

If you're building software, RobotPan changes what's possible. Instead of writing code to switch between three camera feeds and apologizing to operators about blind spots, you now have a single, unified 360° view that streams in real-time. For developers, this means you can assume complete spatial awareness in your UI and decision-making logic—no more 'was there an obstacle behind the ?' uncertainty. The spherical coordinate system is also a lesson: when you have strong geometric constraints (like 'everything lives on a sphere around the '), exploit them in your representation. Hierarchical allocation of compute is another principle to internalize—allocating fine detail near the agent and coarse detail far away is a pattern you'll see in rendering engines, , and systems across robotics. For data collection and : RobotPan releases a multi-sensor for 360° novel view synthesis and reconstruction covering real , , and tasks. This is a gift to the community— your own surround-view models becomes feasible. Finally, the paper demonstrates that feed-forward prediction (no test-time optimization) can be competitive with slower methods if you design the architecture right. This is critical for robotics, where kills .

LIMITATIONS

The paper doesn't explicitly detail failure modes, but implied limitations include: (1) The six-camera ring is bespoke hardware—retrofitting it to arbitrary robots requires mechanical integration work. (2) The method assumes calibrated, overlapping camera views; calibration drift or camera damage would degrade performance. (3) The hierarchical voxel structure allocates detail based on radius from the center, which works well for a mobile manipulator but might not generalize to very different morphologies. (4) Online fusion with selective updates assumes relatively slow scene change; rapid dynamic scenes (multiple moving people) might cause Gaussian coherence issues. (5) Real-time performance claims likely depend on specific hardware (GPU availability on the or a companion streaming computer); budgets aren't quantified in the available materials.

WHAT COMES NEXT

Future versions will likely tackle: (1) Pushing toward higher-resolution rendering and finer geometric detail per Gaussian while keeping the count constant, leveraging recent advances in 3D Gaussian splatting. (2) Extending to multi-robot scenarios where multiple humanoids coordinate—a single operator needs surround views of multiple agents simultaneously. (3) Tighter integration with and policies—using RobotPan's reconstruction as direct input to learned policies rather than just for visualization. (4) Reducing hardware requirements: current six-camera + setup is expensive; future work might achieve similar results with fewer, cheaper sensors. (5) Handling extreme lighting conditions and transparency (glass obstacles, reflections) which currently challenge and camera-based systems equally. The broader is toward where and are co-designed: RobotPan's success suggests that a holistic /software stack designed specifically for outperforms piecemeal approaches.

Read on arxiv →HTML source →Project page →

RobotPan: A 360° Surround-View Robotic Vision System for Embodied Perception

ARCHITECTURE

THE PROBLEM

HOW IT WORKS

Hardware: Six-Camera + LiDAR Ring Configuration

Spherical Coordinate Lifting: Unifying Multi-View Features

Hierarchical Spherical Voxel Priors: Smart Resolution Allocation

Online Fusion with Selective Appearance Updates

LiDAR Grounding for Metric Reconstruction

KEY RESULTS

PERFORMANCE COMPARISON

WHY DEVELOPERS SHOULD CARE

LIMITATIONS

WHAT COMES NEXT

RELATED PAPERS

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

Octo: An Open-Source Generalist Robot Policy