Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Improving Transparency in AI Language Models: A Holistic Evaluation | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
policyIssue Brief

Improving Transparency in AI Language Models: A Holistic Evaluation

Date
February 28, 2023
Topics
Machine Learning
Foundation Models
Read Paper
abstract

In this brief, Stanford scholars introduce Holistic Evaluation of Language Models (HELM) as a framework to evaluate commercial application of AI use cases.

Foundation Model Issue Brief Series

In collaboration with the Stanford Center for Research on Foundation Models (CRFM), we present a series of issue briefs on key policy issues related to foundation models—machine learning models trained on massive datasets to power an unprecedented array of applications. Drawing on the latest research and expert insights from the field, this series aims to provide policymakers and the public with a clear and nuanced understanding of these complex technologies, and to help them make informed decisions about how best to regulate and govern their development and use.

Key Takeaways

  • Transparency into AI systems is necessary for policymakers, researchers, and the public to understand these systems and their impacts. We introduced Holistic Evaluation of Language Models (HELM) as a framework to benchmark language models as a concrete path to provide this transparency.

  • Traditional methods for evaluating language models focus on model accuracy in specific scenarios. Since language models are already used for many different purposes (e.g., summarizing documents, answering questions, retrieving information), HELM covers a broader range of use cases, evaluating for the many relevant metrics (e.g., fairness, efficiency, robustness, toxicity).

  • In the absence of a clear standard, language model evaluation has been uneven: Different model developers evaluate on different benchmarks, meaning their models cannot be easily compared. We establish a clear head-to-head comparison by evaluating 34 prominent language models from 12 different providers (e.g., OpenAI, Google, Microsoft, Meta).

  • HELM serves as public reporting on AI models—especially for those that are closed-access or widely deployed—empowering decision-makers to understand their function and impact and to ensure their design aligns with human-centered values.

Introduction

Artificial intelligence (AI) language models are everywhere. People can talk to their smartphone through voice assistants like Siri and Cortana. Consumers can play music, turn up thermostats, and check the weather through smart speakers, like Google Nest or Amazon Echo, that likewise use language models to process commands. Online translation tools help people traveling the world or learning a new language. Algorithms flag “offensive” and “obscene” comments on social media platforms. The list goes on.

The rise of language models, like the text generation tool ChatGPT, is just the tip of the iceberg in the larger paradigm shift toward foundation models—machine learning models, including language models, trained on massive datasets to power an unprecedented array of applications. Their meteoric rise is only surpassed by their sweeping impact: They are reconstituting established industries like web search, transforming practices in classroom education, and capturing widespread media attention. Consequently, characterizing these models is a pressing social matter: If an AI-powered content moderation tool that flags toxic online comments cannot distinguish between offensive and satirical uses of the same word, it could censor speech from marginalized communities.

But how to evaluate foundation models is an open question. The public lacks adequate transparency into these models, from the code underpinning the model to the training and testing data used to bring it into the world. Evaluation presents a way forward by concretely measuring the capabilities, risks, and limitations of foundation models.

In our paper from a 50-person team at the Stanford Center for Research on Foundation Models (CRFM), we propose a framework, Holistic Evaluation of Language Models (HELM), to address the lack of transparency for language models. HELM implements these comprehensive assessments—yielding results that researchers, policymakers, the broader public, and other stakeholders can use.

Read Paper
Share
Link copied to clipboard!
Authors
  • Rishi Bommasani
    Rishi Bommasani
  • Daniel Zhang
    Daniel Zhang
  • Tony Lee
    Tony Lee
  • Percy Liang
    Percy Liang

Related Publications

Policy Implications of DeepSeek AI’s Talent Base
Amy Zegart, Emerson Johnston
Quick ReadMay 06, 2025
Policy Brief

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

Policy Brief

Policy Implications of DeepSeek AI’s Talent Base

Amy Zegart, Emerson Johnston
International Affairs, International Security, International DevelopmentFoundation ModelsWorkforce, LaborQuick ReadMay 06

This brief presents an analysis of Chinese AI startup DeepSeek’s talent base and calls for U.S. policymakers to reinvest in competing to attract and retain global AI talent.

What Makes a Good AI Benchmark?
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, Mykel Kochenderfer
Dec 11, 2024
Policy Brief
What Makes a Good AI Benchmark

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.

Policy Brief
What Makes a Good AI Benchmark

What Makes a Good AI Benchmark?

Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, Mykel Kochenderfer
Foundation ModelsPrivacy, Safety, SecurityDec 11

This brief presents a novel assessment framework for evaluating the quality of AI benchmarks and scores 24 benchmarks against the framework.

Response to U.S. AI Safety Institute’s Request for Comment on Managing Misuse Risk For Dual-Use Foundation Models
Rishi Bommasani, Alexander Wan, Yifan Mai, Percy Liang, Daniel E. Ho
Sep 09, 2024
Response to Request

In this response to the U.S. AI Safety Institute’s (US AISI) request for comment on its draft guidelines for managing the misuse risk for dual-use foundation models, scholars from Stanford HAI, the Center for Research on Foundation Models (CRFM), and the Regulation, Evaluation, and Governance Lab (RegLab) urge the US AISI to strengthen its guidance on reproducible evaluations and third- party evaluations, as well as clarify guidance on post-deployment monitoring. They also encourage the institute to develop similar guidance for other actors in the foundation model supply chain and for non-misuse risks, while ensuring the continued open release of foundation models absent evidence of marginal risk.

Response to Request

Response to U.S. AI Safety Institute’s Request for Comment on Managing Misuse Risk For Dual-Use Foundation Models

Rishi Bommasani, Alexander Wan, Yifan Mai, Percy Liang, Daniel E. Ho
Regulation, Policy, GovernanceFoundation ModelsPrivacy, Safety, SecuritySep 09

In this response to the U.S. AI Safety Institute’s (US AISI) request for comment on its draft guidelines for managing the misuse risk for dual-use foundation models, scholars from Stanford HAI, the Center for Research on Foundation Models (CRFM), and the Regulation, Evaluation, and Governance Lab (RegLab) urge the US AISI to strengthen its guidance on reproducible evaluations and third- party evaluations, as well as clarify guidance on post-deployment monitoring. They also encourage the institute to develop similar guidance for other actors in the foundation model supply chain and for non-misuse risks, while ensuring the continued open release of foundation models absent evidence of marginal risk.

How Persuasive is AI-Generated Propaganda?
Josh A. Goldstein, Jason Chao, Shelby Grossman, Alex Stamos, Michael Tomz
Sep 03, 2024
Policy Brief

This brief presents the findings of an experiment that measures how persuasive AI-generated propaganda is compared to foreign propaganda articles written by humans.

Policy Brief

How Persuasive is AI-Generated Propaganda?

Josh A. Goldstein, Jason Chao, Shelby Grossman, Alex Stamos, Michael Tomz
DemocracyFoundation ModelsSep 03

This brief presents the findings of an experiment that measures how persuasive AI-generated propaganda is compared to foreign propaganda articles written by humans.

OSZAR »