As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.
Figure 1: Overview of the TongSIM System Architecture. The platform consists of a UE5-based simulator and a Python controller. It supports multimodal data sensors, high-fidelity simulation, large-scale NPC systems, and parallel training, integrated with a robust evaluation system.
| Features | TongSIM (Ours) |
GRUtopia | OmniGibson | Habitat | VirtualHome | Virtual Community | |
|---|---|---|---|---|---|---|---|
| Core | Engine Base | UE 5 | Isaac Sim | Isaac Sim | Custom (C++) | Unity3D | Genesis |
| Environment & Scenes |
Scene Categories | 115 | ~100 (Annotated) | 51 (8 Types) | 211 | 6 Homes | 35 Urban Areas |
| Indoor Scope | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| Outdoor Scope | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | |
| City-level Interaction | ✔ | ✔ | ✘ | ✘ | ✘ | ✔ | |
| Platform Features |
Parallel Training | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ |
| Task-oriented fidelity | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | |
| NPC Control | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | |
| Sim-to-Real Support | ✔ | ✔ | ✔ | ✔ | ✘ | ✔ | |
| Supported Tasks |
Single-Agent | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| Multi-Agent | ✔ | ✔ | ✘ | ✔ | ✔ | ✔ | |
| Human-Robot Teaming | ✔ | ✘ | ✘ | ✔ | ✘ | ✔ | |
A single-agent benchmark requiring autonomous navigation and obstacle avoidance to clean up scattered debris in complex multi-room indoor environments.
A multi-agent collaboration benchmark where agents collaborate to collect supplies while dodging dynamic hazards in a partially observable post-flood environment.
A social navigation benchmark testing a robot's ability to move safely and socially compliantly through dense, dynamic human crowds in urban settings.
A comprehensive evaluation of MLLM-driven agents on diverse household activities spanning object understanding, spatial reasoning, and social interaction.
A benchmark designed to evaluate Embodied Social Intelligence. It requires agents within 3D environments to engage in proactive dialogue with NPCs—characterized by complex preferences and social relationships—while conducting autonomous exploration to fulfill seating arrangement tasks under multi-objective constraints.