Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitation of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.
Hanabi (Player 0)
Hanabi (Player 1)
Overcooked
Breakthrough
Kuhn Poker
Atari Pong
Coin Dilemma
Monster Hunt
Battle of the Colors
Strategic reasoning is the theory-of-mind ability to infer the hidden beliefs, desires, and intentions of other agents. This requires agents to think from others' perspectives and answer the question: What would other agents do in the next steps? Strategic reasoning is crucial in multi-agent environments because an agent's reward function depends not only on its own action, but also on others' actions.
Models | Overall | Cooperative | Competitive | Mixed-Motive | |||||
---|---|---|---|---|---|---|---|---|---|
Hanabi | Overcooked | Board | Poker | Pong | Dilemma | Hunt | Battle | ||
Oracle | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
o4-mini | 47.8 | 58.3 | 31.8 | 26.8 | 63.5 | 43.5 | 62.8 | 43.8 | 52.5 |
gemini-2.5-flash | 45.5 | 37.0 | 21.0 | 23.3 | 65.0 | 41.3 | 63.5 | 50.3 | 62.8 |
claude-3-7-sonnet | 42.4 | 39.0 | 26.0 | 24.3 | 65.5 | 44.8 | 53.8 | 42.5 | 43.3 |
doubao-1-5-thinking-pro | 36.6 | 32.8 | 26.3 | 19.8 | 57.8 | 44.3 | 26.5 | 45.3 | 40.0 |
qvq-max | 30.5 | 32.3 | 19.0 | 21.8 | 59.3 | 37.8 | 25.3 | 21.5 | 27.0 |
gpt-4.1 | 36.6 | 23.0 | 27.0 | 22.5 | 54.0 | 41.5 | 49.9 | 36.8 | 38.0 |
doubao-1-5-vision-pro | 30.8 | 15.0 | 22.3 | 15.8 | 53.8 | 31.3 | 37.3 | 36.0 | 34.8 |
gemini-2.5 w/o thinking | 29.2 | 21.5 | 19.3 | 14.8 | 48.5 | 34.0 | 31.8 | 30.5 | 33.8 |
qwen-vl-max | 28.5 | 26.5 | 26.0 | 19.5 | 45.3 | 23.5 | 26.3 | 23.5 | 37.3 |
claude-3-7 w/o thinking | 28.3 | 9.8 | 16.0 | 18.0 | 56.0 | 43.3 | 31.0 | 26.0 | 26.8 |
grok-2-vision | 25.7 | 12.8 | 17.3 | 10.8 | 53.3 | 20.8 | 30.3 | 31.5 | 29.0 |
Qwen2.5-VL-72B-Ins. | 30.4 | 26.8 | 26.5 | 23.8 | 45.2 | 27.0 | 30.0 | 27.3 | 36.8 |
InternVL3-78B | 29.7 | 25.3 | 20.5 | 14.0 | 45.5 | 34.8 | 37.0 | 30.0 | 30.3 |
Llama-3.2-90B-Vision-Ins. | 26.4 | 20.0 | 16.5 | 11.8 | 53.3 | 36.3 | 26.3 | 25.0 | 18.8 |
Random | 24.3 | 8.8 | 16.7 | 4.3 | 50.0 | 33.3 | 25.4 | 29.3 | 26.5 |
Table 1: Strategic reasoning evaluation results. For each environment, the first, second, and third best results are highlighted in green, while the results below random are highlighted in red.
Finding 1: Existing VLMs exhibit preliminary strategic reasoning ability by outperforming random guesses, but they are still far from accurate prediction of others' next actions.
Decision-making is the ability to optimize for one's long-term objectives under uncertainty. This requires agents to prioritize future accumulated returns over immediate gains, adapt to non-stationary dynamics with evolving agents, and balance cooperation and competition to navigate toward favorable equilibria.
Models | Overall | Cooperative | Competitive | Mixed-Motive | |||||
---|---|---|---|---|---|---|---|---|---|
Hanabi | Overcooked | Board | Poker | Pong | Dilemma | Hunt | Battle | ||
Optimal | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
o4-mini | 24.3 | 42.9±30.5 | 17.0±6.8 | 30.0±47.0 | 69.3±32.9 | 11.2±16.5 | -4.6±21.4 | 24.9±8.2 | 3.5±5.4 |
doubao-1-5-thinking-pro | 21.3 | 56.7±22.8 | 10.1±4.7 | 10.0±21.0 | 68.5±24.3 | 2.9±2.5 | 0.7±3.2 | 17.2±11.3 | 4.0±4.8 |
gemini-2.5-flash | 20.1 | 27.1±36.0 | 8.5±5.4 | 20.0±25.75 | 34.7±53.2 | 1.6±1.9 | 10.0±25.5 | 26.2±5.8 | 32.8±8.5 |
claude-3-7-sonnet | 19.5 | 6.7±21.1 | 10.1±3.5 | 20.0±39.75 | 92.9±22.7 | -0.5±1.0 | 4.6±15.4 | 19.9±3.5 | 2.5±4.6 |
qvq-max | -0.1 | 0.0±0.0 | 2.0±3.4 | 5.0±31.5 | -8.7±37.8 | 0.4±1.6 | 0.0±2.1 | 0.7±4.5 | -0.5±0.0 |
gemini-2.5 w/o thinking | 3.0 | 0.0±0.0 | 2.0±4.0 | 0.0±0.0 | 18.2±30.4 | 1.0±1.4 | -0.7±4.3 | 0.7±8.9 | 2.5±3.4 |
gpt-4.1 | 2.8 | 0.0±0.0 | -0.5±0.0 | 0.0±0.0 | -7.1±40.4 | 0.2±1.4 | 17.8±6.7 | 11.2±5.6 | 0.5±2.0 |
qwen-vl-max | -1.0 | 1.2±2.0 | -0.5±0.0 | 0.0±0.0 | -20.5±56.9 | -0.3±1.0 | -0.4±2.8 | 13.2±20.2 | -0.5±0.0 |
grok-2-vision | -1.1 | 0.0±0.0 | 1.5±3.3 | 0.0±0.0 | -11.8±54.2 | -0.1±1.5 | 1.1±7.0 | -0.4±5.8 | 0.5±2.0 |
claude-3-7 w/o thinking | -1.9 | 0.0±0.0 | 2.0±4.0 | 5.0±31.5 | -23.6±50.9 | -0.9±0.3 | 1.4±9.2 | 0.2±8.2 | 1.0±2.3 |
doubao-1-5-vision-pro | -4.5 | 0.0±0.0 | -0.5±0.0 | 0.0±0.0 | -40.2±57.4 | -0.9±0.3 | -2.1±5.2 | 7.8±8.2 | -0.5±0.0 |
Qwen2.5-VL-72B-Ins. | 1.9 | 0.8±1.8 | -0.5±0.0 | 0.0±0.0 | -3.2±49.3 | -0.8±0.2 | 0.0±2.7 | 19.6±25.7 | -0.5±0.0 |
InternVL3-78B | 1.0 | 0.0±0.0 | 0.0±1.5 | 0.0±0.0 | 4.0±62.1 | -0.9±0.3 | 6.8±8.9 | -1.8±9.2 | 0.0±1.5 |
Llama-3.2-90B-Vision-Ins. | -4.2 | 0.0±0.0 | 1.5±3.3 | 0.0±0.0 | -39.4±59.7 | -0.9±0.3 | 0.4±3.4 | 3.6±4.9 | 1.0±2.3 |
Random | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Table 2: Decision-making evaluation results. For each environment, the first, second, and third best results are highlighted in green, while the results below or equal to random are highlighted in red.
Finding 2: Existing VLMs struggle with decision-making in multi-agent environments, leaving a significant performance gap that remains an open challenge for future research.
In principle, multimodal observations with both images and texts provide richer information and are expected to give better results. However, the evaluations on reasoning and decision-making show that environments with inherent visual states, like video games, are especially challenging for VLM agents, indicating their potential incompetence with multimodal observations. We investigate this by considering text-only observations that replace images with corresponding text descriptions. We select a board game, a card game, and a video game, and evaluate reasoning VLMs for decision-making with multimodal and text-only observations.
Finding 3: Existing VLMs can fail to extract visual information and improve strategic reasoning and decision-making performance with multimodal observations.
We observe in the evaluation results that reasoning models generally achieve better performance than chat models. We further investigate the test-time scaling of VLMs in multi-agent environments by using Chain-of-Thought (CoT) prompting for chat models and comparing their performance with reasoning models and chat models with simple IO prompting.
Finding 4: Test-time scaling like reasoning and Chain-of-Thought (CoT) prompting can substantially improve VLMs' performance in multi-agent environments.
Another interesting observation in the evaluation results is that open-source models can achieve comparable results to reasoning models in some mixed-motive games. We investigate this by visualizing the behaviors of two leading reasoning models and the best-performing open-source models in each social dilemma games.
Finding 5: Open-source VLMs can achieve comparable results to commercial reasoning VLMs in some social dilemma games with prosocial behaviors for mutual benefit.
To understand why VLMs underperform in multi-agent environments, we conduct a qualitative analysis of their failure cases. In strategic reasoning, two common failure cases are ignoring history and private information. For example, in Hanabi, players' cards are observable to other agents but not to themselves. VLMs often overlook this information asymmetry and incorrectly use their private information to predict the next actions of others. In decision-making, another common failure case is focusing excessively on one's own actions while ignoring those of others. For example, in Breakthrough, VLMs tend to persistently advance their own pieces and fail to identify defensive vulnerabilities that directly result in losing the match.
@article{xu2025vs, title={VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments}, author={Xu, Zelai and Xu, Zhexuan and Yi, Xiangmin and Yuan, Huining and Chen, Xinlei and Wu, Yi and Yu, Chao and Wang, Yu}, journal={arXiv preprint arXiv:2506.02387}, year={2025} }