ICML 2026

ERGeoBench

A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

ERGeoBench evaluates whether multimodal large language models can actively explore street-view environments and progressively infer their geographic location through observation, evidence verification, hypothesis updating, and action planning.

Code Paper Dataset Citation

Code, prompt templates, evaluation scripts, annotations, and Street View panorama identifiers will be released soon.

Overview

Existing geo-localization benchmarks usually evaluate models using a single image or a full panoramic observation. ERGeoBench studies a more active setting: an embodied agent can rotate, pitch, and zoom to gather evidence over multiple steps. The benchmark is designed to diagnose vision-driven embodied geo-localization in multimodal large language models.

Evaluation Settings

Single-view

The model receives one fixed perspective view and predicts the location from limited evidence.

Panorama-view

The model receives the full panorama, providing broad visual coverage without sequential exploration.

Embodied-view

The model actively controls yaw, pitch, and zoom to collect evidence and refine its hypothesis.

Capability Dimensions

Foundational Perception: recognition of architecture, signage, vegetation, terrain, and infrastructure.
Spatial Awareness: viewpoint, direction, depth, relative position, and cross-view consistency.
Commonsense Reasoning: geographic, environmental, cultural, and functional inference.
Geo-localization: country, city, street or area, and GPS prediction.

Data Release

To support reproducibility while respecting data redistribution constraints, we will release panorama identifiers and metadata rather than redistributing raw Street View imagery. Users should retrieve corresponding visual observations through official data access interfaces and comply with applicable terms of use.

sample_id,pano_id,country,city,split
000001,PLACEHOLDER_PANO_ID,Japan,Tokyo,test
000002,PLACEHOLDER_PANO_ID,France,Paris,test

Release Plan

Evaluation scripts for Single-view, Panorama-view, and Embodied-view settings
Prompt templates for embodied geo-localization
Benchmark metadata and annotation files
Street View panorama identifiers
Model prediction files and evaluation utilities

Paper

The paper link will be updated after the official proceedings or preprint becomes available.

Citation

@inproceedings{xue2026ergeobench,
  title     = {ERGeoBench: A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models},
  author    = {Xue, Kaiwen and Wei, Tao and Ou, Zhonghong and Zhang, Guoxin and Lu, Kaoyan and Feng, Yu and Zhu, Yifan and Luo, Haoran},
  booktitle = {Proceedings of the International Conference on Machine Learning},
  year      = {2026}
}