Ego to World: Collaborative Spatial Reasoning
in Embodied Systems via Reinforcement Learning

Heng Zhou1,2,*, Li Kang2,3,*, Yiran Qin2,‡, Xiufeng Song2,3, Ao Yu1,
Zilu Zhang4, Haoming Song2,3, Kaixin Xu2,5, Yuchen Fan2,3, Dongzhan Zhou2,
Xiaohong Liu3, Ruimao Zhang6, Philip Torr7, Lei Bai2,†, Zhenfei Yin7,†
1University of Science and Technology of China  2Shanghai AI Laboratory  3Shanghai Jiao Tong University
4Beijing University of Posts and Telecommunications  5Fudan University  6Sun Yat-sen University  7University of Oxford
* Equal contributions  ‡ Project leader  † Corresponding authors

Figure 1. Single-view reasoning fails under occlusion. Cross-view reasoning leverages overlapping perspectives to correctly locate and grasp target objects across multiple robot viewpoints.


Abstract

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates vision-language models' ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics. Beyond simulation, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation.


Method

Figure 2. CoRL Framework. Stage 1: SFT with Chain-of-Thought supervision on sim & real data. Stage 2: Reinforcement fine-tuning with Cross-View Spatial Reward (CVSR) — comprising grounding reward (bbox IoU), overlap reward (cross-view entity matching), and answer reward (task-specific accuracy).


E2W Benchmark

Figure 3. E2W-Bench overview. Three-level tasks with data collected from simulation environments (diverse 3D assets) and real-world robot setups (annotated via RAM + Grounding DINO).


Qualitative Examples

Outputs from Qwen2.5-VL-3B-CoRL on E2W-Bench test set, showing chain-of-thought reasoning with cross-view spatial analysis.

 E2W-1: Object Counting
Counting View A
View A — Main
Counting View B
View B — Auxiliary
 Can you assess how many apple are present?
I can see a apple at [195, 74, 207, 85] in View A ... a apple at [111, 129, 124, 139] in View B ... The apple in auxiliary perspective is not visible in main perspective ... Combining observations from both perspectives, the total is 2.
\boxed{2} Correct
 E2W-2: Spatial Reasoning
Reasoning View A
View A — Main
Reasoning View B
View B — Auxiliary
 Which object is the third from the left to right on the table?
In View B: cardboardbox at [4,93,49,107], pizza at [136,66,159,87], kettle at [204,83,226,103], hammer at [328,94,360,106]. Spatial order left→right: cardboardbox → pizza → kettle → hammer. The third is kettle.
\boxed{kettle} Correct
 E2W-3: Grasping
Grasping View A
View A — Main
Grasping View B
View B — Auxiliary
 Please grasp the apple the third from the front to back on the table.
The target apple is visible in main perspective at [248, 89, 255, 95]. Center coordinates: [251.5, 92.0]. I should grasp at image 0 coordinates [251.5, 92.0].
\boxed{0, [251.5, 92.0]} Correct

Results on E2W-Bench

Model E2W-1 E2W-2(S) E2W-2(R) Avg. Reasoning E2W-3(S) Avg. Perception
Qwen2.5-VL-3B + CoRL 60.8 92.0 84.0 78.9 97.7 97.7
Qwen2.5-VL-7B + CoRL 66.8 92.8 89.2 82.9 97.8 97.8

E2W-1: Object Counting (exact match %) •  E2W-2: Spatial Reasoning, (S)im / (R)eal (exact match %) •  E2W-3: Grasping, (S)im (proximity score)


BibTeX

@misc{zhou2026egoworldcollaborativespatial,
    title={Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning}, 
    author={Heng Zhou and Li Kang and Yiran Qin and Xiufeng Song and Ao Yu and Zilu Zhang and Haoming Song and Kaixin Xu and Yuchen Fan and Dongzhan Zhou and Xiaohong Liu and Ruimao Zhang and Philip Torr and Lei Bai and Zhenfei Yin},
    year={2026},
    eprint={2603.14811},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2603.14811}, 
}