MedArena: Comparing LLMs for Medicine in the Wild

Date

April 24, 2025

Topics

Stanford scholars leverage physicians to evaluate 11 large language models in real-world settings.

The use of large language models (LLMs) in the medical domain holds transformative potential, promising advancements in areas ranging from clinical decision support and medical education to patient communication. This increasing relevance is highlighted by recent reports indicating that up to two-thirds of American physicians now utilize AI tools in their practice.

Realizing this potential safely and effectively hinges on the development of rigorous and clinically relevant evaluation methodologies. Currently, the predominant approaches for assessing the medical capabilities of LLMs, namely benchmark datasets derived from MMLU (Massive Multitask Language Understanding) and MedQA (Medical Question Answering), primarily utilize static, multiple-choice question (MCQ) formats. While valuable for gauging foundational knowledge, these existing evaluation paradigms suffer from significant challenges that limit their applicability to real-world clinical contexts.

Firstly, they typically assess a narrow spectrum of medical knowledge, often neglecting other critical and common LLM use cases in healthcare, such as patient communication, clinical documentation generation, or summarizing medical literature. Secondly, their static nature means they fail to reflect the most recent medical knowledge, such as the latest drug approvals or recently updated clinical guidelines. Furthermore, the reliance on MCQ formats oversimplifies the complexities of clinical reasoning and practice. Clinicians are seldom presented with pre-defined options when diagnosing a patient or formulating a treatment plan. Evaluations focused solely on identifying the single 'correct' answer overlook the critical importance of the diagnostic process itself. Particularly in clinical diagnosis, the method of information synthesis and the overall presentation of the reasoning are often as important, if not more so, than the final conclusion. Existing methods fail to capture these nuances, offering an incomplete picture of an LLM's true clinical utility.

This highlights a critical need for evaluation frameworks that move beyond these current limitations. Specifically, medical LLM evaluations must become more dynamic, capable of reflecting the most current medical questions and adapting to the iterative nature of clinical questioning, and more holistic, assessing the entire response quality, including reasoning, multi-turn conversations, and clinical appropriateness, rather than merely scoring factual accuracy on a fixed answer set. How do we move to an evaluation that incorporates real-world clinician questions with model responses evaluated by clinicians?

To this end, we introduce MedArena.ai, a novel LLM evaluation platform specifically designed for clinical medicine. MedArena provides a free, interactive arena for clinicians to test and compare top-performing LLMs on their medical queries.

How MedArena works

MedArena is open to clinicians only. To authenticate users, we partnered with Doximity, a networking service for medical professionals. Clinicians can sign in with their Doximity account or, alternatively, provide their National Provider Identifier (NPI) number. For an input query, the user is presented with responses from two randomly chosen LLMs and asked to specify which model they prefer (Figure 1). Our platform then aggregates the preferences and presents a leaderboard (Figure 2), ranking different LLMs against each other. To help clinicians understand their own most-preferred LLMs, we also provide personal rankings based on a user's individual data, given a minimum number of preferences.

Figure 1: An example query on the MedArena platform. Upon submitting a medical query, users are shown anonymized, side-by-side responses from two randomly selected models. Users then select their preferred response (“Model A,” “Model B,” “Tie,” or “Neither”) and may optionally provide a free-text explanation for their choice. MedArena supports both single- and multi-turn queries, including image-based prompts. The platform includes 11 top-performing, commercially available LLMs from major providers.

Figure 2: The MedArena leaderboard from April 23, 2025. From the clinician preferences collected, we calculate a ranking of all models based on the Bradley-Terry and Elo ratings. Win rates, bootstrapped 95% confidence intervals, and pairwise p-values are also calculated to assess statistical significance in model comparisons.

Since our general release in early March, we have collected over 1,200 clinician preferences from over 300 clinicians representing 80+ subspecialties on 11 top-performing LLMs from providers like OpenAI GPT, Google Gemini, Meta Llama, and Anthropic Claude. We find that Google Gemini models are significantly more preferred than other models like GPT-4o and o1.

We categorized the clinician queries into one of six categories (Figure 3, top): Medical Knowledge and Evidence, Treatment and Guidelines, Clinical Cases and Diagnosis, Patient Communication and Education, Clinical Documentation and Practical Information, and Miscellaneous. Medical Knowledge and Evidence only accounted for about a third (38%) of the questions asked, while the other two-thirds consisted of questions about treatments, clinical cases, documentation, and communications. From the provided reasons for model response preference, we also cluster them into six categories (Figure 3, bottom): Depth and Detail, Accuracy and Clinical Validity, Presentation and Clarity, Use of References and up-to-date Guidelines, and Miscellaneous. The most common reason (32%) provided is Depth and Detail.

Figure 3: Examples of question categories (top) and preference reasons (bottom). We observe a strong heterogeneity in the types of questions asked (e.g., medical knowledge and patient communication) and the reasons provided for the preference (e.g., depth and detail and use of references).

Finally, by computing the model rankings using the Bradley-Terry model, we can also control for factors like the style and length of responses. In our analysis, we find that while longer responses are significantly preferred and positively correlated with higher win rates, they are not a significant predictor of model preference. However, other stylistic factors like the presence of bold text and lists are significant confounders for model preference.

Our Findings

As of April 23, 2025, Gemini 2.0 Flash Thinking is the top-ranked model on MedArena, followed by GPT-4o and Gemini 2.5 Pro. Interestingly, we observe that more powerful reasoning models like OpenAI's o1 or o3-mini do not outperform many non-reasoning models like GPT-4o and Perplexity's Llama model. Our early results also highlight the mismatch between existing benchmark tasks and the actual types of questions clinicians ask. Only about a third of real-world queries fell into the traditional category of medical knowledge and evidence, the focus of most current evaluations (e.g., MedQA, MMLU). The majority of clinician input instead centered on practical, context-rich areas like treatment decision-making, patient communication, and documentation — domains poorly captured by static MCQs. Additionally, about 20% of conversations were multi-turn, which is also not captured in current evaluation benchmarks. This divergence underscores the need for evaluation systems grounded in real clinical workflows.

The diversity of user preferences further emphasizes that model evaluation should go beyond correctness. Clinicians frequently cited qualities like “depth and detail” and “clarity of presentation” as driving their preferences — features that are essential for trust and utility but are not captured by current automated metrics. Interestingly, stylistic elements such as formatting (e.g., bolding, lists) were found to significantly influence model preference, revealing that perceived usability and readability play a non-trivial role in model evaluation. This introduces a key challenge: how to disentangle true model reasoning quality from superficial presentation enhancements in future evaluation frameworks.

These findings also highlight the value of using paired comparisons and ranking models in head-to-head scenarios, rather than relying on static accuracy scores. The Bradley-Terry model enables a more nuanced analysis by accounting for confounding factors such as response length and formatting.

MedArena provides a scalable, clinician-centric framework for evaluating LLMs in medicine. As these tools increasingly enter clinical workflows, we hope that platforms like MedArena will improve the way that clinical medicine is evaluated in a manner that reflects the nuanced, contextual nature of real-world medical practice.

James Zou is an associate professor of Biomedical Data Science and, by courtesy, of Computer Science and Electrical Engineering at Stanford University; Eric Wu is a PhD candidate in electrical engineering; Kevin Wu is a PhD candidate in biomedical informatics.