Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
MedArena: Comparing LLMs for Medicine in the Wild | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
news

MedArena: Comparing LLMs for Medicine in the Wild

Date
April 24, 2025
Topics
Healthcare
Natural Language Processing
Generative AI

Stanford scholars leverage physicians to evaluate 11 large language models in real-world settings.

The use of large language models (LLMs) in the medical domain holds transformative potential, promising advancements in areas ranging from clinical decision support and medical education to patient communication. This increasing relevance is highlighted by recent reports indicating that up to two-thirds of American physicians now utilize AI tools in their practice.

Realizing this potential safely and effectively hinges on the development of rigorous and clinically relevant evaluation methodologies. Currently, the predominant approaches for assessing the medical capabilities of LLMs, namely benchmark datasets derived from MMLU (Massive Multitask Language Understanding) and MedQA (Medical Question Answering), primarily utilize static, multiple-choice question (MCQ) formats. While valuable for gauging foundational knowledge, these existing evaluation paradigms suffer from significant challenges that limit their applicability to real-world clinical contexts. 

Firstly, they typically assess a narrow spectrum of medical knowledge, often neglecting other critical and common LLM use cases in healthcare, such as patient communication, clinical documentation generation, or summarizing medical literature. Secondly, their static nature means they fail to reflect the most recent medical knowledge, such as the latest drug approvals or recently updated clinical guidelines. Furthermore, the reliance on MCQ formats oversimplifies the complexities of clinical reasoning and practice. Clinicians are seldom presented with pre-defined options when diagnosing a patient or formulating a treatment plan. Evaluations focused solely on identifying the single 'correct' answer overlook the critical importance of the diagnostic process itself. Particularly in clinical diagnosis, the method of information synthesis and the overall presentation of the reasoning are often as important, if not more so, than the final conclusion. Existing methods fail to capture these nuances, offering an incomplete picture of an LLM's true clinical utility.

This highlights a critical need for evaluation frameworks that move beyond these current limitations. Specifically, medical LLM evaluations must become more dynamic, capable of reflecting the most current medical questions and adapting to the iterative nature of clinical questioning, and more holistic, assessing the entire response quality, including reasoning, multi-turn conversations, and clinical appropriateness, rather than merely scoring factual accuracy on a fixed answer set. How do we move to an evaluation that incorporates real-world clinician questions with model responses evaluated by clinicians? 

To this end, we introduce MedArena.ai, a novel LLM evaluation platform specifically designed for clinical medicine. MedArena provides a free, interactive arena for clinicians to test and compare top-performing LLMs on their medical queries. 

How MedArena works

MedArena is open to clinicians only. To authenticate users, we partnered with Doximity, a networking service for medical professionals. Clinicians can sign in with their Doximity account or, alternatively, provide their National Provider Identifier (NPI) number. For an input query, the user is presented with responses from two randomly chosen LLMs and asked to specify which model they prefer (Figure 1). Our platform then aggregates the preferences and presents a leaderboard (Figure 2), ranking different LLMs against each other. To help clinicians understand their own most-preferred LLMs, we also provide personal rankings based on a user's individual data, given a minimum number of preferences. 

Figure 1: An example query on the MedArena platform. Upon submitting a medical query, users are shown anonymized, side-by-side responses from two randomly selected models. Users then select their preferred response (“Model A,” “Model B,” “Tie,” or “Neither”) and may optionally provide a free-text explanation for their choice. MedArena supports both single- and multi-turn queries, including image-based prompts. The platform includes 11 top-performing, commercially available LLMs from major providers.

Figure 2: The MedArena leaderboard from April 23, 2025. From the clinician preferences collected, we calculate a ranking of all models based on the Bradley-Terry and Elo ratings. Win rates, bootstrapped 95% confidence intervals, and pairwise p-values are also calculated to assess statistical significance in model comparisons.

Since our general release in early March, we have collected over 1,200 clinician preferences from over 300 clinicians representing 80+ subspecialties on 11 top-performing LLMs from providers like OpenAI GPT, Google Gemini, Meta Llama, and Anthropic Claude. We find that Google Gemini models are significantly more preferred than other models like GPT-4o and o1.

We categorized the clinician queries into one of six categories (Figure 3, top): Medical Knowledge and Evidence, Treatment and Guidelines, Clinical Cases and Diagnosis, Patient Communication and Education, Clinical Documentation and Practical Information, and Miscellaneous. Medical Knowledge and Evidence only accounted for about a third (38%) of the questions asked, while the other two-thirds consisted of questions about treatments, clinical cases, documentation, and communications. From the provided reasons for model response preference, we also cluster them into six categories (Figure 3, bottom): Depth and Detail, Accuracy and Clinical Validity, Presentation and Clarity, Use of References and up-to-date Guidelines, and Miscellaneous. The most common reason (32%) provided is Depth and Detail. 

Figure 3: Examples of question categories (top) and preference reasons (bottom). We observe a strong heterogeneity in the types of questions asked (e.g., medical knowledge and patient communication) and the reasons provided for the preference (e.g., depth and detail and use of references).

Finally, by computing the model rankings using the Bradley-Terry model, we can also control for factors like the style and length of responses. In our analysis, we find that while longer responses are significantly preferred and positively correlated with higher win rates, they are not a significant predictor of model preference. However, other stylistic factors like the presence of bold text and lists are significant confounders for model preference.

Our Findings 

As of April 23, 2025, Gemini 2.0 Flash Thinking is the top-ranked model on MedArena, followed by GPT-4o and Gemini 2.5 Pro. Interestingly, we observe that more powerful reasoning models like OpenAI's o1 or o3-mini do not outperform many non-reasoning models like GPT-4o and Perplexity's Llama model.  Our early results also highlight the mismatch between existing benchmark tasks and the actual types of questions clinicians ask. Only about a third of real-world queries fell into the traditional category of medical knowledge and evidence, the focus of most current evaluations (e.g., MedQA, MMLU). The majority of clinician input instead centered on practical, context-rich areas like treatment decision-making, patient communication, and documentation — domains poorly captured by static MCQs. Additionally, about 20% of conversations were multi-turn, which is also not captured in current evaluation benchmarks. This divergence underscores the need for evaluation systems grounded in real clinical workflows.

The diversity of user preferences further emphasizes that model evaluation should go beyond correctness. Clinicians frequently cited qualities like “depth and detail” and “clarity of presentation” as driving their preferences — features that are essential for trust and utility but are not captured by current automated metrics. Interestingly, stylistic elements such as formatting (e.g., bolding, lists) were found to significantly influence model preference, revealing that perceived usability and readability play a non-trivial role in model evaluation. This introduces a key challenge: how to disentangle true model reasoning quality from superficial presentation enhancements in future evaluation frameworks.

These findings also highlight the value of using paired comparisons and ranking models in head-to-head scenarios, rather than relying on static accuracy scores. The Bradley-Terry model enables a more nuanced analysis by accounting for confounding factors such as response length and formatting. 

MedArena provides a scalable, clinician-centric framework for evaluating LLMs in medicine. As these tools increasingly enter clinical workflows, we hope that platforms like MedArena will improve the way that clinical medicine is evaluated in a manner that reflects the nuanced, contextual nature of real-world medical practice.

James Zou is an associate professor of Biomedical Data Science and, by courtesy, of Computer Science and Electrical Engineering at Stanford University; Eric Wu is a PhD candidate in electrical engineering; Kevin Wu is a PhD candidate in biomedical informatics.

Share
Link copied to clipboard!
Contributor(s)
Eric Wu, Kevin Wu, James Zou

Related News

Ambient Intelligence, Human Impact
May 07, 2025
News

Health care providers struggle to catch early signals of cognitive decline. AI and computational neuroscientist Ehsan Adeli’s innovative computer vision tools may offer a solution.

News

Ambient Intelligence, Human Impact

HealthcareComputer VisionMay 07

Health care providers struggle to catch early signals of cognitive decline. AI and computational neuroscientist Ehsan Adeli’s innovative computer vision tools may offer a solution.

A Framework to Report AI’s Flaws
Andrew Myers
Apr 28, 2025
News

Pointing to "white-hat" hacking, AI policy experts recommend a new system of third-party reporting and tracking of AI’s flaws.

News

A Framework to Report AI’s Flaws

Andrew Myers
Ethics, Equity, InclusionGenerative AIPrivacy, Safety, SecurityApr 28

Pointing to "white-hat" hacking, AI policy experts recommend a new system of third-party reporting and tracking of AI’s flaws.

Assessing the Role of Intelligent Tutors in K-12 Education
Nikki Goth Itoi
Apr 21, 2025
News

Scholars discover short-horizon data from edtech platforms can help predict student performance in the long term.

News

Assessing the Role of Intelligent Tutors in K-12 Education

Nikki Goth Itoi
Education, SkillsGenerative AIApr 21

Scholars discover short-horizon data from edtech platforms can help predict student performance in the long term.

Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
OSZAR »