Ego to World: Collaborative Spatial Reasoning
in Embodied Systems via Reinforcement Learning

Heng Zhou^1,2,*, Li Kang^2,3,*, Yiran Qin^2,‡, Xiufeng Song^2,3, Ao Yu¹,
Zilu Zhang⁴, Haoming Song^2,3, Kaixin Xu^2,5, Yuchen Fan^2,3, Dongzhan Zhou²,
Xiaohong Liu³, Ruimao Zhang⁶, Philip Torr⁷, Lei Bai^2,†, Zhenfei Yin^7,†

¹University of Science and Technology of China ²Shanghai AI Laboratory ³Shanghai Jiao Tong University
⁴Beijing University of Posts and Telecommunications ⁵Fudan University ⁶Sun Yat-sen University ⁷University of Oxford

* Equal contributions ‡ Project leader † Corresponding authors

Code 🤗 Models E2W-Bench arXiv

Figure 1. Single-view reasoning fails under occlusion. Cross-view reasoning leverages overlapping perspectives to correctly locate and grasp target objects across multiple robot viewpoints.

Abstract

Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates vision-language models' ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring coherent cross-view entity resolution, and guiding the model toward correct final predictions. Experiments on E2W show that CoRL consistently surpasses strong proprietary and open-source baselines on both reasoning and perception-grounding metrics. Beyond simulation, CoRL generalizes to external spatial reasoning benchmarks and enables effective real-world multi-robot manipulation.

Method

Figure 2. CoRL Framework. Stage 1: SFT with Chain-of-Thought supervision on sim & real data. Stage 2: Reinforcement fine-tuning with Cross-View Spatial Reward (CVSR) — comprising grounding reward (bbox IoU), overlap reward (cross-view entity matching), and answer reward (task-specific accuracy).

E2W Benchmark

Figure 3. E2W-Bench overview. Three-level tasks with data collected from simulation environments (diverse 3D assets) and real-world robot setups (annotated via RAM + Grounding DINO).

Qualitative Examples

Outputs from Qwen2.5-VL-3B-CoRL on E2W-Bench test set, showing chain-of-thought reasoning with cross-view spatial analysis.

E2W-1: Object Counting

View A — Main

View B — Auxiliary

Can you assess how many apple are present?

I can see a apple at [195, 74, 207, 85] in View A ... a apple at [111, 129, 124, 139] in View B ... The apple in auxiliary perspective is not visible in main perspective ... Combining observations from both perspectives, the total is 2.

\boxed{2} Correct

E2W-2: Spatial Reasoning

View A — Main

View B — Auxiliary

Which object is the third from the left to right on the table?

In View B: cardboardbox at [4,93,49,107], pizza at [136,66,159,87], kettle at [204,83,226,103], hammer at [328,94,360,106]. Spatial order left→right: cardboardbox → pizza → kettle → hammer. The third is kettle.

\boxed{kettle} Correct

E2W-3: Grasping

View A — Main

View B — Auxiliary

Please grasp the apple the third from the front to back on the table.

The target apple is visible in main perspective at [248, 89, 255, 95]. Center coordinates: [251.5, 92.0]. I should grasp at image 0 coordinates [251.5, 92.0].

\boxed{0, [251.5, 92.0]} Correct

Results on E2W-Bench

Model	E2W-1	E2W-2(S)	E2W-2(R)	Avg. Reasoning	E2W-3(S)	Avg. Perception
Qwen2.5-VL-3B + CoRL	60.8	92.0	84.0	78.9	97.7	97.7
Qwen2.5-VL-7B + CoRL	66.8	92.8	89.2	82.9	97.8	97.8

E2W-1: Object Counting (exact match %) • E2W-2: Spatial Reasoning, (S)im / (R)eal (exact match %) • E2W-3: Grasping, (S)im (proximity score)

BibTeX

@misc{zhou2026egoworldcollaborativespatial,
    title={Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning}, 
    author={Heng Zhou and Li Kang and Yiran Qin and Xiufeng Song and Ao Yu and Zilu Zhang and Haoming Song and Kaixin Xu and Yuchen Fan and Dongzhan Zhou and Xiaohong Liu and Ruimao Zhang and Philip Torr and Lei Bai and Zhenfei Yin},
    year={2026},
    eprint={2603.14811},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://arxiv.org/abs/2603.14811}, 
}

Ego to World: Collaborative Spatial Reasoningin Embodied Systems via Reinforcement Learning

Abstract

Method

E2W Benchmark

Qualitative Examples

Results on E2W-Bench

BibTeX

Ego to World: Collaborative Spatial Reasoning
in Embodied Systems via Reinforcement Learning