Our Favorite Papers

I am working on summarizing and highlighting the key points of each paper listed in this section individually. Stay tuned for updates to this page!

Core Papers

Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)
Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
The Llama 3 Herd of Models (Meta, 2024)
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)
Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)

Other Evals-Related literature

LLM Powered Autonomous Agents (Weng, 2023)
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., 2024)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023)
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models (Zhou et al., 2023)
Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure (Scheurer et al., 2023)
Identifying the Risks of LM Agents with an LM-Emulated Sandbox (Ruan et al., 2023))

Benchmarks for Evals

MMLU: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024)
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (Li et al., 2024)

Benchmark Papers

GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023) - QA dataset with challenging questions, potentially some errors or unanswerable items.
AgentBench: Evaluating LLMs as Agents (Liu et al., 2023) - Presents 8 open-ended environments for LM agents.
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs (Laine et al., 2024) - Assesses LLM situational awareness. Disclosure: Apollo involvement.
TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021) - Evaluates whether LLMs mimic human misconceptions or falsehoods.
Towards Understanding Sycophancy in Language Models (Sharma et al., 2023) - MC questions to evaluate sycophancy in LLMs.
GAIA: A Benchmark for General AI Assistants (Mialon, 2023) - Real-world questions and tasks for LM agent reasoning.
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark (Pan et al., 2023) - Over 500,000 scenarios focused on social decision-making through CYOA games.

Science of Evals

Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)
HELM: Holistic Evaluation of Language Models (Liang et al., 2022)

Other Benchmark and Evaluation Papers

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023) - Explores using LLM judges in arena settings to rate other LLMs. Useful for understanding ELO-based systems.
A Survey on Evaluation of Large Language Models (Chang et al., 2023) - Comprehensive overview of LLM evaluation methods.
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design (Sclar et al., 2023) - Discusses how minor prompt changes can impact model performance significantly. See also Mizrahi et al., 2023 for a multi-prompt evaluation approach.
Leveraging Large Language Models for Multiple Choice Question Answering (Robinson et al., 2022) - Examines how formatting choices in MCQA benchmarks affect performance.
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Turpin et al., 2023) - Investigates misalignment between LLM explanations and the underlying algorithm used. See also Lanham et al., 2023 for related work on faithfulness in reasoning.

Software

Core

Inspect - Open-source evals library maintained by UK AISI and spearheaded by JJ Allaire. It supports various types of evals, including MC benchmarks and LM agent settings.
Vivaria - METR's open-sourced evals tool, optimized for LM agent evaluations and the METR task standard.
Aider - A widely-used open-source coding assistant, recommended for speeding up coding tasks.

Other

AideML - Tool frequently used in Kaggle competitions. Includes some example agents by METR.
See also Jacques Thibodeau’s Guide on "How much I'm paying for AI productivity software".

Miscellaneous

Core

Building an Early Warning System for LLM-Aided Biological Threat Creation (OpenAI, 2024) - Measures uplift of GPT-4 across the bio threat creation pipeline.
A Careful Examination of Large Language Model Performance on Grade School Arithmetic (Zhang et al., 2024) - Tests model overfitting on grade school math benchmarks.
Devising ML Metrics (Hendrycks and Woodside, 2024) - Discusses essential principles for designing effective evaluation metrics. See also Wei, 2024 for successful evals insights.

Other

Model Organisms of Misalignment (Hubinger, 2023) - Argues for building small-scale versions of concerning AI threat models for study.
When Can We Trust Model Evaluations? (Hubinger, 2023) - Outlines conditions for trusting model evaluation results.
The Operational Risks of AI in Large-Scale Biological Attacks (Mouton et al., 2024) - RAND study testing LLMs' potential uplift for bio-weapon applications.
Language Models (Mostly) Know What They Know (Kadavath, 2022) - Assesses whether LLMs can accurately predict their own QA performance.
Are We Learning Yet? A Meta-Review of Evaluation Failures Across Machine Learning (Liao, 2021) - Meta-evaluation analyzing validity failures in ML survey papers.
Challenges in Evaluating AI Systems (Anthropic, 2023) - Describes three main failure modes encountered during eval development.
Towards Understanding-Based Safety Evaluations (Hubinger, 2023) - Advocates for understanding-based over behavioral-only evaluations for alignment.
A Starter Guide for Model Evaluations (Apollo, 2024) - Introductory guide on model evaluations. Disclosure: Apollo post.
Video: Intro to Model Evaluations by Marius (Apollo, 2024) - A 40-minute non-technical intro to model evaluations.
METR's Autonomy Evaluation Resources (METR, 2024) - Collection of resources for LM agent evaluations.
UK AISI’s Early Insights from Developing Question-Answer Evaluations for Frontier AI (UK AISI, 2024) - Insights from building QA evaluations for frontier AI.

Red Teaming Litereature

Core

Jailbroken: How does LLM Safety Training Fail? (Wei et al., 2023) - Classic paper on jailbreaks.
Red Teaming Language Models with Language Models (Perez et al., 2022) - Demonstrates using LLMs to red team other LLMs.
Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023) - Shows that jailbreaks trained on open-source models may transfer to closed-source ones.

Other

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (Ganguli, 2022) - Describes red teaming approaches for language models.
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Shah et al., 2023) - Useful for understanding prompt design in red teaming.
Frontier Threats Red Teaming for AI Safety (Anthropic, 2023) - High-level perspective on red teaming for frontier threats.

Scalable Oversight

Core

Debating with More Persuasive LLMs Leads to More Truthful Answers (Khan et al., 2023) - Insightful paper on AI debate effectiveness.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (Burns et al., 2024) - Examines potential of weak supervision for strong capabilities.

Other

Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions (Parrish, 2022)
Measuring Progress on Scalable Oversight for Large Language Models (Bowman, 2022)
Prover-Verifier Games Improve Legibility of LLM Outputs (Kirchner et al., 2024)

Scaling Laws & Emergent Behaviors

Core

Emergent Abilities of Large Language Models (Wei et al., 2022)
Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al., 2023)
Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach (Ilić, 2023)

Other

Predictability and Surprise in LLMs (Ganguli et al., 2022)
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Ren et al., 2024)
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (Srivastava et al., 2022)

Science Tutorials

Core

Research as a Stochastic Decision Process (Steinhardt) - Suggests maximizing information gain in experiment order.
Tips for Empirical Alignment Research (Ethan Perez, 2024)
You and Your Research (Hamming, 1986) - Classic work on prioritizing important problems.

Other

An Opinionated Guide on ML Research (Schulman, 2020)
A Recipe for Training Neural Networks (Andrej Karpathy, 2019)
How I Select Alignment Research Projects (Ethan Perez, 2024) - Video on research selection.

LLM Capabilities

Core

Llama 3 Paper - Overview of Llama 3.
GPT-3 Paper: Language Models Are Few-Shot Learners (Brown et al., 2020)

Other

GPT-4 Paper (OpenAI, 2023)
Sparks of Artificial General Intelligence (Bubeck et al., 2023)
The False Promise of Imitating Proprietary LLMs (Gudibande et al., 2023)

LLM Steering - RLHF

Core

Learning to Summarize from Human Feedback (Stiennon et al., 2020)
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
Training a Helpful and Harmless Assistant with RLHF (Bai et al., 2022)

Other

Recursively Summarizing Books with Human Feedback (Wu et al., 2021)
WebGPT: Browser-Assisted Question Answering (Nakano et al., 2021)

Supervised Fine-Tuning & Prompting

Core

Training Language Models to Follow Instructions with Human Feedback (Ouyang et al., 2022)

Other

True Few-Shot Learning with Language Models (Perez et al., 2021)
Training Language Models with Language Feedback (Scheurer et al., 2022)

Fairness, Bias, and Accountability

Actionable Auditing (Raji et al., 2019) - Evaluation of benchmark impacts in real-world cases like Gender Shades.
Closing the AI Accountability Gap (Raji et al., 2020) - Framework for lifecycle auditing of AI systems.
Who Audits the Auditors? Recommendations from a Field Scan of the Algorithmic Auditing Ecosystem (Costanza-Chock, 2022) - Survey and recommendations on algorithmic auditing.

AI Governance

Core

Model Evaluations for Extreme Risks (Shevlane, 2023)
Anthropic: Responsible Scaling Policy (RSP) (Anthropic, 2023) - Details if-then guidelines for scaling responses.

Other

METR: Responsible Scaling Policies (METR, 2023)
OpenAI: Preparedness Framework (OpenAI, 2023)
GoogleDeepMind: Frontier Safety Framework (GDM, 2024)
Visibility into AI Agents (Chan et al., 2024) - Examines ways to track AI agent activities.
Structured Access for Third-Party Research on Frontier AI Models (Bucknall et al., 2023)
A Causal Framework for AI Regulation and Auditing (Sharkey et al., 2023)
Black Box Auditing Is Insufficient for Rigorous Audits (Casper et al., 2023)