VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

1Tsinghua University, 2University of Science and Technology of China,
3Shanghai Qi Zhi Institute, 4Beijing Zhongguancun Academy
*Equal contribution

Overview

Figure 1. Overview of VS-Bench.
Figure 1: Overview of VS-Bench, a multimodal benchmark for evaluating VLMs in multi-agent environments.

Abstract

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and linguistic contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic reasoning and decision-making in multi-agent environments. VS-Bench comprises eight vision-grounded environments spanning cooperative, competitive, and mixed-motive interactions, designed to assess agents' ability to predict others' future moves and optimize for long-term objectives. We consider two complementary evaluation dimensions, including offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return. Extensive experiments of fourteen leading VLMs reveal a significant gap between current models and optimal performance, with the best models attaining 47.8% prediction accuracy and 24.3% normalized return. We further conduct in-depth analyses on multimodal observations, test-time scaling, social behaviors, and failure cases of VLM agents. By standardizing the evaluation and highlighting the limitation of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents.

Figure 2. Strategic Reasoning Results
Figure 2: Evaluation results on strategic reasoning.
Figure 3. Decision Making Results
Figure 3: Evaluation results on decision-making.

Environments

Cooperative Games

Hanabi (Player 0)

Hanabi (Player 1)

Overcooked

Competitive Games

Breakthrough

Kuhn Poker

Atari Pong

Mixed-Motive Games

Coin Dilemma

Monster Hunt

Battle of the Colors

Benchmark Results

Strategic Reasoning

Strategic reasoning is the theory-of-mind ability to infer the hidden beliefs, desires, and intentions of other agents. This requires agents to think from others' perspectives and answer the question: What would other agents do in the next steps? Strategic reasoning is crucial in multi-agent environments because an agent's reward function depends not only on its own action, but also on others' actions.

Models Overall Cooperative Competitive Mixed-Motive
Hanabi Overcooked Board Poker Pong Dilemma Hunt Battle
Oracle 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
o4-mini 47.8 58.3 31.8 26.8 63.5 43.5 62.8 43.8 52.5
gemini-2.5-flash 45.5 37.0 21.0 23.3 65.0 41.3 63.5 50.3 62.8
claude-3-7-sonnet 42.4 39.0 26.0 24.3 65.5 44.8 53.8 42.5 43.3
doubao-1-5-thinking-pro 36.6 32.8 26.3 19.8 57.8 44.3 26.5 45.3 40.0
qvq-max 30.5 32.3 19.0 21.8 59.3 37.8 25.3 21.5 27.0
gpt-4.1 36.6 23.0 27.0 22.5 54.0 41.5 49.9 36.8 38.0
doubao-1-5-vision-pro 30.8 15.0 22.3 15.8 53.8 31.3 37.3 36.0 34.8
gemini-2.5 w/o thinking 29.2 21.5 19.3 14.8 48.5 34.0 31.8 30.5 33.8
qwen-vl-max 28.5 26.5 26.0 19.5 45.3 23.5 26.3 23.5 37.3
claude-3-7 w/o thinking 28.3 9.8 16.0 18.0 56.0 43.3 31.0 26.0 26.8
grok-2-vision 25.7 12.8 17.3 10.8 53.3 20.8 30.3 31.5 29.0
Qwen2.5-VL-72B-Ins. 30.4 26.8 26.5 23.8 45.2 27.0 30.0 27.3 36.8
InternVL3-78B 29.7 25.3 20.5 14.0 45.5 34.8 37.0 30.0 30.3
Llama-3.2-90B-Vision-Ins. 26.4 20.0 16.5 11.8 53.3 36.3 26.3 25.0 18.8
Random 24.3 8.8 16.7 4.3 50.0 33.3 25.4 29.3 26.5

Table 1: Strategic reasoning evaluation results. For each environment, the first, second, and third best results are highlighted in green, while the results below random are highlighted in red.

Finding 1: Existing VLMs exhibit preliminary strategic reasoning ability by outperforming random guesses, but they are still far from accurate prediction of others' next actions.

Decision-Making

Decision-making is the ability to optimize for one's long-term objectives under uncertainty. This requires agents to prioritize future accumulated returns over immediate gains, adapt to non-stationary dynamics with evolving agents, and balance cooperation and competition to navigate toward favorable equilibria.

Models Overall Cooperative Competitive Mixed-Motive
Hanabi Overcooked Board Poker Pong Dilemma Hunt Battle
Optimal 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
o4-mini 24.3 42.9±30.5 17.0±6.8 30.0±47.0 69.3±32.9 11.2±16.5 -4.6±21.4 24.9±8.2 3.5±5.4
doubao-1-5-thinking-pro 21.3 56.7±22.8 10.1±4.7 10.0±21.0 68.5±24.3 2.9±2.5 0.7±3.2 17.2±11.3 4.0±4.8
gemini-2.5-flash 20.1 27.1±36.0 8.5±5.4 20.0±25.75 34.7±53.2 1.6±1.9 10.0±25.5 26.2±5.8 32.8±8.5
claude-3-7-sonnet 19.5 6.7±21.1 10.1±3.5 20.0±39.75 92.9±22.7 -0.5±1.0 4.6±15.4 19.9±3.5 2.5±4.6
qvq-max -0.1 0.0±0.0 2.0±3.4 5.0±31.5 -8.7±37.8 0.4±1.6 0.0±2.1 0.7±4.5 -0.5±0.0
gemini-2.5 w/o thinking 3.0 0.0±0.0 2.0±4.0 0.0±0.0 18.2±30.4 1.0±1.4 -0.7±4.3 0.7±8.9 2.5±3.4
gpt-4.1 2.8 0.0±0.0 -0.5±0.0 0.0±0.0 -7.1±40.4 0.2±1.4 17.8±6.7 11.2±5.6 0.5±2.0
qwen-vl-max -1.0 1.2±2.0 -0.5±0.0 0.0±0.0 -20.5±56.9 -0.3±1.0 -0.4±2.8 13.2±20.2 -0.5±0.0
grok-2-vision -1.1 0.0±0.0 1.5±3.3 0.0±0.0 -11.8±54.2 -0.1±1.5 1.1±7.0 -0.4±5.8 0.5±2.0
claude-3-7 w/o thinking -1.9 0.0±0.0 2.0±4.0 5.0±31.5 -23.6±50.9 -0.9±0.3 1.4±9.2 0.2±8.2 1.0±2.3
doubao-1-5-vision-pro -4.5 0.0±0.0 -0.5±0.0 0.0±0.0 -40.2±57.4 -0.9±0.3 -2.1±5.2 7.8±8.2 -0.5±0.0
Qwen2.5-VL-72B-Ins. 1.9 0.8±1.8 -0.5±0.0 0.0±0.0 -3.2±49.3 -0.8±0.2 0.0±2.7 19.6±25.7 -0.5±0.0
InternVL3-78B 1.0 0.0±0.0 0.0±1.5 0.0±0.0 4.0±62.1 -0.9±0.3 6.8±8.9 -1.8±9.2 0.0±1.5
Llama-3.2-90B-Vision-Ins. -4.2 0.0±0.0 1.5±3.3 0.0±0.0 -39.4±59.7 -0.9±0.3 0.4±3.4 3.6±4.9 1.0±2.3
Random 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Table 2: Decision-making evaluation results. For each environment, the first, second, and third best results are highlighted in green, while the results below or equal to random are highlighted in red.

Finding 2: Existing VLMs struggle with decision-making in multi-agent environments, leaving a significant performance gap that remains an open challenge for future research.

Analysis

Multimodal Observations

In principle, multimodal observations with both images and texts provide richer information and are expected to give better results. However, the evaluations on reasoning and decision-making show that environments with inherent visual states, like video games, are especially challenging for VLM agents, indicating their potential incompetence with multimodal observations. We investigate this by considering text-only observations that replace images with corresponding text descriptions. We select a board game, a card game, and a video game, and evaluate reasoning VLMs for decision-making with multimodal and text-only observations.

Figure 4. Multimodal Input Analysis
Figure 4: Comparison of reasoning VLMs on decision-making with multimodal and text-only observations. The solid and dashed vertical lines represent the average results of two settings.

Finding 3: Existing VLMs can fail to extract visual information and improve strategic reasoning and decision-making performance with multimodal observations.

Test-Time Scaling

We observe in the evaluation results that reasoning models generally achieve better performance than chat models. We further investigate the test-time scaling of VLMs in multi-agent environments by using Chain-of-Thought (CoT) prompting for chat models and comparing their performance with reasoning models and chat models with simple IO prompting.

Figure 5. Social Dilemma Behaviors
Figure 5: Comparison of reasoning VLMs and chat VLMs on decision-making with IO and CoT prompting. The solid, dashed, and dotted vertical lines represent the average results of three settings.

Finding 4: Test-time scaling like reasoning and Chain-of-Thought (CoT) prompting can substantially improve VLMs' performance in multi-agent environments.

Social Behaviors

Another interesting observation in the evaluation results is that open-source models can achieve comparable results to reasoning models in some mixed-motive games. We investigate this by visualizing the behaviors of two leading reasoning models and the best-performing open-source models in each social dilemma games.

Figure 6. Social Dilemma Behaviors
Figure 6: Behaviors of two reasoning models and the best-performing open-source models in mixed-motive social dilemma games.

Finding 5: Open-source VLMs can achieve comparable results to commercial reasoning VLMs in some social dilemma games with prosocial behaviors for mutual benefit.

Failure Examples

To understand why VLMs underperform in multi-agent environments, we conduct a qualitative analysis of their failure cases. In strategic reasoning, two common failure cases are ignoring history and private information. For example, in Hanabi, players' cards are observable to other agents but not to themselves. VLMs often overlook this information asymmetry and incorrectly use their private information to predict the next actions of others. In decision-making, another common failure case is focusing excessively on one's own actions while ignoring those of others. For example, in Breakthrough, VLMs tend to persistently advance their own pieces and fail to identify defensive vulnerabilities that directly result in losing the match.

BibTeX

          @article{xu2025vs,
            title={VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments},
            author={Xu, Zelai and Xu, Zhexuan and Yi, Xiangmin and Yuan, Huining and Chen, Xinlei and Wu, Yi and Yu, Chao and Wang, Yu},
            journal={arXiv preprint arXiv:2506.02387},
            year={2025}
          }