Single-view
The model receives one fixed perspective view and predicts the location from limited evidence.
A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
ERGeoBench evaluates whether multimodal large language models can actively explore street-view environments and progressively infer their geographic location through observation, evidence verification, hypothesis updating, and action planning.
Existing geo-localization benchmarks usually evaluate models using a single image or a full panoramic observation. ERGeoBench studies a more active setting: an embodied agent can rotate, pitch, and zoom to gather evidence over multiple steps. The benchmark is designed to diagnose vision-driven embodied geo-localization in multimodal large language models.
The model receives one fixed perspective view and predicts the location from limited evidence.
The model receives the full panorama, providing broad visual coverage without sequential exploration.
The model actively controls yaw, pitch, and zoom to collect evidence and refine its hypothesis.
To support reproducibility while respecting data redistribution constraints, we will release panorama identifiers and metadata rather than redistributing raw Street View imagery. Users should retrieve corresponding visual observations through official data access interfaces and comply with applicable terms of use.
sample_id,pano_id,country,city,split
000001,PLACEHOLDER_PANO_ID,Japan,Tokyo,test
000002,PLACEHOLDER_PANO_ID,France,Paris,test
The paper link will be updated after the official proceedings or preprint becomes available.
@inproceedings{xue2026ergeobench,
title = {ERGeoBench: A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models},
author = {Xue, Kaiwen and Wei, Tao and Ou, Zhonghong and Zhang, Guoxin and Lu, Kaoyan and Feng, Yu and Zhu, Yifan and Luo, Haoran},
booktitle = {Proceedings of the International Conference on Machine Learning},
year = {2026}
}