Escalation Risks from LLMs in Military and Diplomatic Contexts

Date

May 02, 2024

Topics

International Affairs, International Security, International Development

abstract

In this brief, scholars explain how they designed a wargame simulation to evaluate the escalation risks of large language models (LLMs) in high-stakes military and diplomatic decision-making.

Key Takeaways

Many nations are increasingly considering integrating autonomous AI agents in high-stakes military and diplomatic decision-making.
We designed a novel wargame simulation and scoring framework to evaluate the escalation risks of actions taken by AI agents based on five off-the-shelf large language models (LLMs). We found that all models show forms of escalation and difficult-to-predict escalation patterns that lead to greater conflict and, in some cases, the use of nuclear weapons.
The model with the most escalatory and unpredictable decisions was the only tested LLM that did not undergo reinforcement learning with human feedback—a safety technique to align models to human instructions. This underscores the importance of alignment techniques and fine-tuning.
Policymakers should be cautious to proceed when confronted with proposals to use LLMs in military and foreign policy decision-making. Turning high-stakes decisions over to autonomous LLM-based agents can lead to significant escalatory action.

Executive Summary

Following the widespread adoption of ChatGPT and other large language models (LLMs), policymakers and scholars are increasingly discussing how LLM-based agents—AI models that can reason about uncertainty and decide what actions are optimal—could be integrated into high-stakes military and diplomatic decision-making. In 2023, the U.S. military reportedly began evaluating five LLMs in a simulated conflict scenario to test military planning capacity. Palantir, Scale AI, and other companies are already building LLM-based decision-making systems for the U.S. military. Meanwhile, there has also been an uptick in conversations around employing LLM-based agents to augment foreign policy decision-making.

Some argue that, compared to humans, LLMs deployed in military and diplomatic decision-making contexts could process more information, make decisions significantly faster, allocate resources more efficiently, and better facilitate communication between key personnel. At the same time, however, concerns about the risks of over-relying on autonomous agents have increased. While AI-based models may make fewer emotionally driven decisions, compared to human decision-making, these could lead to more unpredictable and escalatory behavior. Last year, a bipartisan bill proposed to block the use of federal funds for AI that launches or selects targets for nuclear weapons without meaningful human control while the White House’s Executive Order on AI requires government oversight of AI applications in national defense.

In our paper, “Escalation Risks from Language Models in Military and Diplomatic Decision-Making,” we designed a wargame simulation and scoring framework to evaluate how LLM-based agents behave in conflict scenarios without human oversight. We focused on five off-the-shelf LLMs, assessing how actions chosen by these agents in different scenarios could contribute to escalation risks. Our paper is the first of its kind to draw on political science and international relations literature on escalation dynamics to generate qualitative and quantitative insights into LLMs in these settings. Our findings show that LLMs exhibit difficult-to-predict, escalatory behavior, which underscores the importance of understanding when, how, and why LLMs may fail in these high-stakes contexts.

Introduction

Analysts have long used wargames to simulate conflict scenarios. Previous research with computer-assisted wargames—ranging from decision-support systems to comprehensive simulations—has examined how computer systems perform in these high-consequence settings. One 2021 study found that wargames with heavy computer automation have been more likely to lead to nuclear use. However, there have been only limited wargame simulations that focus specifically on the behavior of LLM-based agents. One notable study explored the use of a combination of LLMs and reinforcement learning models in the game Diplomacy but did not examine LLMs by themselves. A new partnership between an AI startup and a think tank will explore using LLMs in wargames, but it is unclear if results will be made publicly available.

Our research adds to this body of work by quantitatively and qualitatively evaluating the use of off-the-shelf LLMs in wargame scenarios. In particular, we focus on the risk of escalation, which renowned military strategist Herman Kahn described as a situation where there is competition in risk-taking and resolve and where fear that the other side will overreact serves as a deterrent. We evaluate how LLM-based agents behave in simulated conflict scenarios and whether, and how, their decisions could contribute to an escalation of the conflict.

For each simulation, we set up eight “nation agents” based on one of five LLMs: OpenAI’s GPT-3.5, GPT-4, and GBT-4-Base; Anthropic’s Claude 2; and Meta’s Llama-2 (70B) Chat. We provided each nation agent model with background information on its nation and told each model that it is a decision-maker in that country’s military and foreign policy interacting with other AI-controlled agents. At each turn, the agents chose up to three actions from a predetermined list of 27 options, which included peaceful actions (such as negotiating trade agreements), neutral actions (such as sending private messages), and escalatory actions (such as executing cyberattacks or launching nuclear weapons). The agents also generated up to 250 words describing their reasoning before choosing their decisions.

We told the agents their actions would have real-world consequences. A separate world model LLM summarized the consequences on the agents and the simulated world, which started out with one of three initial scenarios: a neutral scenario without initial events; an invasion scenario, where one nation invaded another before the simulation began; or a cyberattack scenario, where one LLM-based agent launched a cyberattack on another before the simulation’s start. The agents’ actions and their consequences were revealed simultaneously after each day and fed into prompts given during subsequent days.

Read Paper

Related Publications

Policy Implications of DeepSeek AI’s Talent Base

Amy Zegart, Emerson Johnston

Quick ReadMay 06, 2025

Policy Brief

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Policy Brief

Policy Implications of DeepSeek AI’s Talent Base

Amy Zegart, Emerson Johnston

International Affairs, International Security, International DevelopmentFoundation ModelsWorkforce, LaborQuick ReadMay 06

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

Juan Pava, Haifa Badi Uz Zaman, Caroline Meinhardt, Toni Friedman, Sang T. Truong, Daniel Zhang, Elena Cryst, Vukosi Marivate, Sanmi Koyejo

Deep DiveApr 22, 2025

White Paper

This white paper maps the LLM development landscape for low-resource languages, highlighting challenges, trade-offs, and strategies to increase investment; prioritize cross-disciplinary, community-driven development; and ensure fair data ownership.

White Paper

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

Juan Pava, Haifa Badi Uz Zaman, Caroline Meinhardt, Toni Friedman, Sang T. Truong, Daniel Zhang, Elena Cryst, Vukosi Marivate, Sanmi Koyejo

International Affairs, International Security, International DevelopmentNatural Language ProcessingDeep DiveApr 22

Response to USAID's Request for Information on AI in Global Development Playbook

Caroline Meinhardt, Toni Friedman, Haifa Badi Uz Zaman, Daniel Zhang, Rodrigo Balbontín, Juan Pava, Vyoma Raman, Kevin Klyman, Marietje Schaake, Jef Caers, Francis Fukuyama

Mar 01, 2024

Response to Request

In this response to the U.S. Agency for International Development’s (USAID) request for information on the development of an AI in Global Development Playbook, scholars from Stanford HAI and The Asia Foundation call for an approach to AI in global development that is grounded in local perspectives and tailored to the specific circumstances of Global Majority countries.

Response to Request

Response to USAID's Request for Information on AI in Global Development Playbook

Caroline Meinhardt, Toni Friedman, Haifa Badi Uz Zaman, Daniel Zhang, Rodrigo Balbontín, Juan Pava, Vyoma Raman, Kevin Klyman, Marietje Schaake, Jef Caers, Francis Fukuyama

International Affairs, International Security, International DevelopmentMar 01

Fei Fei Li's Testimony Before the Senate Committee on Homeland Security and Governmental Affairs

Fei-Fei Li

Sep 14, 2023

Testimony

We have arrived at an inflection point in the world of AI, largely propelled by breakthroughs in generative AI, including increasingly sophisticated language models like GPT-4. These models have revolutionized various sectors from customer service to adaptive learning. However, the scope of intelligence is far broader than linguistic capability alone. In my specialized field of computer vision, we have also witnessed remarkable advancements that empower machines to analyze and act upon visual information—essentially teaching computers to 'see.'

Testimony

Fei Fei Li's Testimony Before the Senate Committee on Homeland Security and Governmental Affairs

Fei-Fei Li

Government, Public AdministrationInternational Affairs, International Security, International DevelopmentSep 14

Stay Up To Date

Navigate

Participate

Escalation Risks from LLMs in Military and Diplomatic Contexts

Key Takeaways

Executive Summary

Introduction

Juan-Pablo Rivera

Gabriel Mukobi

Anka Reuel

Max Lamparth

Chandler Smith

Jacquelyn Schneider

Related Publications

Policy Implications of DeepSeek AI’s Talent Base

Policy Implications of DeepSeek AI’s Talent Base

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

Mind the (Language) Gap: Mapping the Challenges of LLM Development in Low-Resource Language Contexts

Response to USAID's Request for Information on AI in Global Development Playbook

Response to USAID's Request for Information on AI in Global Development Playbook

Fei Fei Li's Testimony Before the Senate Committee on Homeland Security and Governmental Affairs

Fei Fei Li's Testimony Before the Senate Committee on Homeland Security and Governmental Affairs