Inspect - Open-source evals library maintained by UK AISI and spearheaded by JJ Allaire. It supports various types of evals, including MC benchmarks and LM agent settings.
Vivaria - METR's open-sourced evals tool, optimized for LM agent evaluations and the METR task standard.
Aider - A widely-used open-source coding assistant, recommended for speeding up coding tasks.
Other
AideML - Tool frequently used in Kaggle competitions. Includes some example agents by METR.
See also Jacques Thibodeau’s Guide on "How much I'm paying for AI productivity software".
Devising ML Metrics (Hendrycks and Woodside, 2024) - Discusses essential principles for designing effective evaluation metrics. See also Wei, 2024 for successful evals insights.
Other
Model Organisms of Misalignment (Hubinger, 2023) - Argues for building small-scale versions of concerning AI threat models for study.
Video: Intro to Model Evaluations by Marius (Apollo, 2024) - A 40-minute non-technical intro to model evaluations.
METR's Autonomy Evaluation Resources (METR, 2024) - Collection of resources for LM agent evaluations.
UK AISI’s Early Insights from Developing Question-Answer Evaluations for Frontier AI (UK AISI, 2024) - Insights from building QA evaluations for frontier AI.