# LLM Evaluation 知识库

> 基于 [alopatenko/LLMEvaluation](https://github.com/alopatenko/LLMEvaluation) 的深度调研知识库
> 生成日期：2026-05-06

## 知识库简介

本知识库系统梳理了 LLM 评估领域的 **191 篇笔记**和 **6 篇主题综述**，覆盖从方法论演进到工程实践的完整光谱。面向技术管理者和架构师，提供从"了解全貌"到"指导决策"的多层次支持。

---

## 主题综述导读

| # | 综述 | 核心洞见 |
|---|------|---------|
| 1 | [LLM 评估方法论演进](insights/01-evaluation-methodology-evolution.md) | 从静态 benchmark 到评估科学化，construct validity 和可复现性成为新基石 |
| 2 | [LLM-as-Judge 的可靠性与局限](insights/02-llm-as-judge-reliability.md) | 四类系统性偏见（position/verbosity/self-preference/style），jury systems 是当前最优改进策略 |
| 3 | [评估工具生态全景](insights/03-evaluation-tools-landscape.md) | 三层架构（harness→框架→平台），按团队规模和场景的选型决策树 |
| 4 | [Benchmark 设计的科学与陷阱](insights/04-benchmark-design-pitfalls.md) | Data contamination 是最大威胁，动态生成和时间窗口是主要对策 |
| 5 | [从排行榜到产品评估的鸿沟](insights/05-leaderboard-to-product-gap.md) | 一维排名无法映射到产品价值，task-specific + user-aligned 是桥梁 |
| 6 | [元评估：评估评估本身的方法](insights/06-meta-evaluation.md) | Error bars、PPI、排名稳定性分析是构建可信评估体系的三大支柱 |

---

## 跨主题核心洞见

1. **评估即产品**：评估不是一次性测试，而是持续的产品能力——需要版本管理、回归检测、抗污染机制
2. **没有万能 benchmark**：不同场景（安全性、事实性、创造性、代码）需要不同评估策略，一维排名是误导
3. **LLM-as-Judge 是双刃剑**：极大降低成本但引入新偏见，必须配合 calibration 和 human-in-the-loop
4. **污染是结构性问题**：静态 benchmark 必然被污染，动态生成（YourBench, LiveCodeBench）是长期方向
5. **元评估缺位严重**：大多数团队只做评估，不评估评估本身——error bars 和 sensitivity analysis 应成为标配

---

## 笔记索引

### Reviews & Surveys（10 篇）

| 笔记 | 主题 |
|------|------|
| [order-in-evaluation-court](notes/reviews-surveys/order-in-evaluation-court.md) | NLG 评估趋势批判分析 |
| [benchmark-squared](notes/reviews-surveys/benchmark-squared.md) | 系统性评估 LLM Benchmarks |
| [evaluation-science-generative-ai](notes/reviews-surveys/evaluation-science-generative-ai.md) | 生成式 AI 评估科学化 |
| [benchmark-large-vision-language-models](notes/reviews-surveys/benchmark-large-vision-language-models.md) | 大视觉语言模型评估 |
| [ai-benchmarks-datasets-llm-evaluation](notes/reviews-surveys/ai-benchmarks-datasets-llm-evaluation.md) | AI Benchmarks 与数据集 |
| [llms-as-judges-survey](notes/reviews-surveys/llms-as-judges-survey.md) | LLM-as-Judge 综合综述 |
| [systematic-survey-critical-review-evaluating-llms](notes/reviews-surveys/systematic-survey-critical-review-evaluating-llms.md) | 系统性调查与批判 |
| [survey-evaluation-multimodal-llm](notes/reviews-surveys/survey-evaluation-multimodal-llm.md) | 多模态 LLM 评估 |
| [survey-useful-llm-evaluation](notes/reviews-surveys/survey-useful-llm-evaluation.md) | 有用的 LLM 评估方法 |
| [evaluating-llms-comprehensive-survey](notes/reviews-surveys/evaluating-llms-comprehensive-survey.md) | LLM 评估综合综述 |

### Leaderboards（16 篇）

| 笔记 | 主题 |
|------|------|
| [open-llm-leaderboard](notes/leaderboards/open-llm-leaderboard.md) | HuggingFace Open LLM 排行榜 |
| [matharena](notes/leaderboards/matharena.md) | MathArena 数学评估 |
| [vidore-v2](notes/leaderboards/vidore-v2.md) | 视觉文档检索 V2 |
| [facts-grounding-leaderboard](notes/leaderboards/facts-grounding-leaderboard.md) | DeepMind 事实性排行榜 |
| [lmsys-arena](notes/leaderboards/lmsys-arena.md) | LMSys Chatbot Arena |
| [gaia-benchmark](notes/leaderboards/gaia-benchmark.md) | 通用 AI 助手基准 |
| [arena-hard](notes/leaderboards/arena-hard.md) | ArenaHard 论文 |
| [arena-hard-auto](notes/leaderboards/arena-hard-auto.md) | ArenaHard 自动化工具 |
| [zeroeval](notes/leaderboards/zeroeval.md) | AllenAI ZeroEval |
| [length-controlled-alpacaeval](notes/leaderboards/length-controlled-alpacaeval.md) | 长度控制的 AlpacaEval |
| [alpaca-eval](notes/leaderboards/alpaca-eval.md) | AlpacaEval 代码 |
| [berkeley-function-calling-leaderboard](notes/leaderboards/berkeley-function-calling-leaderboard.md) | 函数调用排行榜 |
| [enterprise-scenarios-patronus](notes/leaderboards/enterprise-scenarios-patronus.md) | 企业场景评估 |
| [vectara-hallucination-leaderboard](notes/leaderboards/vectara-hallucination-leaderboard.md) | 幻觉排行榜 |
| [llmperf-leaderboard](notes/leaderboards/llmperf-leaderboard.md) | LLM 性能排行榜 |
| [comparing-llm-performance-anyscale](notes/leaderboards/comparing-llm-performance-anyscale.md) | Anyscale 性能对比 |

### Eval Software（34 篇）

| 笔记 | 主题 |
|------|------|
| [lm-evaluation-harness](notes/eval-software/lm-evaluation-harness.md) | EleutherAI 评估框架 |
| [eureka-ml-insights](notes/eval-software/eureka-ml-insights.md) | Microsoft Eureka |
| [eureka-framework](notes/eval-software/eureka-framework.md) | Eureka 标准化框架论文 |
| [openai-evals](notes/eval-software/openai-evals.md) | OpenAI Evals |
| [llm-comparator](notes/eval-software/llm-comparator.md) | Google PAIR LLM Comparator |
| [openevals](notes/eval-software/openevals.md) | LangChain OpenEvals |
| [yourbench](notes/eval-software/yourbench.md) | HuggingFace YourBench |
| [score-nvidia](notes/eval-software/score-nvidia.md) | Nvidia SCORE |
| [autogenbench](notes/eval-software/autogenbench.md) | Microsoft AutoGenBench |
| [magentic-one](notes/eval-software/magentic-one.md) | Magentic-One 多 Agent |
| [copilot-arena](notes/eval-software/copilot-arena.md) | CoPilot Arena |
| [phoenix-arize](notes/eval-software/phoenix-arize.md) | Arize AI Phoenix |
| [openicl](notes/eval-software/openicl.md) | OpenICL 框架 |
| [deepeval](notes/eval-software/deepeval.md) | Confident-AI DeepEval |
| [mosaicml-composer](notes/eval-software/mosaicml-composer.md) | MosaicML Composer |
| [microsoft-prompty](notes/eval-software/microsoft-prompty.md) | Microsoft Prompty |
| [nvidia-garak](notes/eval-software/nvidia-garak.md) | Nvidia Garak 红队工具 |
| [mozilla-lm-buddy](notes/eval-software/mozilla-lm-buddy.md) | Mozilla lm-buddy |
| [trulens](notes/eval-software/trulens.md) | TruLens |
| [bigcode-evaluation-harness](notes/eval-software/bigcode-evaluation-harness.md) | BigCode 评估 |
| [llmebench](notes/eval-software/llmebench.md) | LLMeBench |
| [lm-pub-quiz](notes/eval-software/lm-pub-quiz.md) | LM-PUB-QUIZ |

### Eval Articles（25 篇）

| 笔记 | 主题 |
|------|------|
| [demystifying-evals-agents](notes/eval-articles/demystifying-evals-agents.md) | Anthropic: Agent 评估揭秘 |
| [product-evals-three-steps](notes/eval-articles/product-evals-three-steps.md) | Eugene Yan: 产品评估三步法 |
| [political-even-handedness](notes/eval-articles/political-even-handedness.md) | Claude 政治偏见测量 |
| [llm-evaluation-4-approaches](notes/eval-articles/llm-evaluation-4-approaches.md) | 四种评估方法 |
| [mastering-llm-techniques-evaluation](notes/eval-articles/mastering-llm-techniques-evaluation.md) | Nvidia 评估技术 |
| [on-gpt-45](notes/eval-articles/on-gpt-45.md) | GPT-4.5 评估分析 |
| [meta-llama3-eval-details](notes/eval-articles/meta-llama3-eval-details.md) | Meta Llama 3 评估 |
| [micro-metrics-llm-evaluation](notes/eval-articles/micro-metrics-llm-evaluation.md) | 微指标框架 |
| [huggingface-evaluation-guidebook](notes/eval-articles/huggingface-evaluation-guidebook.md) | HF 评估指南 |
| [introducing-simpleqa](notes/eval-articles/introducing-simpleqa.md) | OpenAI SimpleQA |
| [llm-decontaminator](notes/eval-articles/llm-decontaminator.md) | LMSys 去污染 |
| [ai-leaderboards-no-longer-useful](notes/eval-articles/ai-leaderboards-no-longer-useful.md) | 排行榜无用论 |
| [your-ai-product-needs-eval](notes/eval-articles/your-ai-product-needs-eval.md) | Hamel: 产品需要 Eval |
| [about-evals-andrew-ng](notes/eval-articles/about-evals-andrew-ng.md) | Andrew Ng 论评估 |
| [frontier-safety-framework](notes/eval-articles/frontier-safety-framework.md) | DeepMind 安全框架 |

### Frontier Benchmarks（23 篇）

| 笔记 | 主题 |
|------|------|
| [mmlu](notes/frontier-benchmarks/mmlu.md) | MMLU 原版 |
| [mmlu-pro](notes/frontier-benchmarks/mmlu-pro.md) | MMLU-Pro |
| [mmlu-pro-plus](notes/frontier-benchmarks/mmlu-pro-plus.md) | MMLU-Pro+ |
| [gpqa](notes/frontier-benchmarks/gpqa.md) | 研究生级问答 |
| [big-bench](notes/frontier-benchmarks/big-bench.md) | BIG-Bench |
| [arc-agi-2](notes/frontier-benchmarks/arc-agi-2.md) | ARC-AGI-2 推理挑战 |
| [humanitys-last-exam](notes/frontier-benchmarks/humanitys-last-exam.md) | 人类最后的考试 |
| [swe-bench-verified](notes/frontier-benchmarks/swe-bench-verified.md) | SWE-Bench Verified |
| [livecodebench-pro](notes/frontier-benchmarks/livecodebench-pro.md) | LiveCodeBench Pro |
| [facts-grounding](notes/frontier-benchmarks/facts-grounding.md) | FACTS 事实性 |
| [tau2-bench](notes/frontier-benchmarks/tau2-bench.md) | 对话 Agent 评估 |
| [vending-bench](notes/frontier-benchmarks/vending-bench.md) | 长期 Agent 连贯性 |
| [michelangelo-long-context](notes/frontier-benchmarks/michelangelo-long-context.md) | 长上下文评估 |
| [charxiv](notes/frontier-benchmarks/charxiv.md) | 图表理解 |
| [screenspot-pro](notes/frontier-benchmarks/screenspot-pro.md) | GUI 操作评估 |
| [video-mmmu](notes/frontier-benchmarks/video-mmmu.md) | 视频理解 |
| [omnidocbench](notes/frontier-benchmarks/omnidocbench.md) | PDF 文档解析 |
| [mmmu-pro](notes/frontier-benchmarks/mmmu-pro.md) | 多模态理解增强版 |
| [simpleqa-verified](notes/frontier-benchmarks/simpleqa-verified.md) | 事实性验证 |
| [global-piqa](notes/frontier-benchmarks/global-piqa.md) | 物理常识推理 |
| [mcp-atlas](notes/frontier-benchmarks/mcp-atlas.md) | MCP 工具使用 |

### Meta-Evaluation（36 篇）

| 笔记 | 主题 |
|------|------|
| [leaderboard-illusion](notes/meta-evaluation/leaderboard-illusion.md) | 排行榜幻觉 |
| [emergent-abilities-mirage](notes/meta-evaluation/emergent-abilities-mirage.md) | 涌现能力是海市蜃楼？ |
| [adding-error-bars](notes/meta-evaluation/adding-error-bars.md) | Anthropic: 给评估加误差线 |
| [sabotage-evaluations](notes/meta-evaluation/sabotage-evaluations.md) | Anthropic: 破坏性评估 |
| [benchmarks-as-targets](notes/meta-evaluation/benchmarks-as-targets.md) | Benchmark 成为目标 |
| [data-contamination-time](notes/meta-evaluation/data-contamination-time.md) | 数据污染随时间变化 |
| [detecting-pretraining-data](notes/meta-evaluation/detecting-pretraining-data.md) | 检测预训练数据 |
| [benchmark-cheater](notes/meta-evaluation/benchmark-cheater.md) | Benchmark 作弊 |
| [helm-holistic-evaluation](notes/meta-evaluation/helm-holistic-evaluation.md) | Stanford HELM |
| [prediction-powered-inference](notes/meta-evaluation/prediction-powered-inference.md) | 预测驱动推断 |
| [ppi-plus-plus](notes/meta-evaluation/ppi-plus-plus.md) | PPI++ |
| [elo-uncovered](notes/meta-evaluation/elo-uncovered.md) | Elo 鲁棒性 |
| [score-consistency-robustness](notes/meta-evaluation/score-consistency-robustness.md) | SCORE 一致性 |
| [mixeval-wisdom-of-crowd](notes/meta-evaluation/mixeval-wisdom-of-crowd.md) | MixEval 混合评估 |
| [ranking-unraveled](notes/meta-evaluation/ranking-unraveled.md) | 排名解构 |
| [theory-dynamic-benchmarks](notes/meta-evaluation/theory-dynamic-benchmarks.md) | 动态 Benchmark 理论 |
| [fix-benchmarking-nlu](notes/meta-evaluation/fix-benchmarking-nlu.md) | 修复 NLU Benchmark |
| [synthetic-data-survey](notes/meta-evaluation/synthetic-data-survey.md) | 合成数据综述 |
| [lifelong-benchmarks](notes/meta-evaluation/lifelong-benchmarks.md) | 终身 Benchmark |
| [faithful-model-evaluation](notes/meta-evaluation/faithful-model-evaluation.md) | 忠实模型评估 |

### LLM-as-Judge（37 篇）

| 笔记 | 主题 |
|------|------|
| [judging-llm-chatbot-arena](notes/llm-as-judge/judging-llm-chatbot-arena.md) | Chatbot Arena 奠基 |
| [llms-as-judges-survey](notes/llm-as-judge/llms-as-judges-survey.md) | 清华综合综述 |
| [style-over-substance](notes/llm-as-judge/style-over-substance.md) | 风格重于实质偏差 |
| [inconsistent-biased-evaluators](notes/llm-as-judge/inconsistent-biased-evaluators.md) | 不一致与偏差 |
| [replacing-judges-with-juries](notes/llm-as-judge/replacing-judges-with-juries.md) | 陪审团替代法官 |
| [language-model-council](notes/llm-as-judge/language-model-council.md) | 语言模型议会 |
| [judgebench](notes/llm-as-judge/judgebench.md) | JudgeBench 基准 |
| [efficient-inference-noisy-judge](notes/llm-as-judge/efficient-inference-noisy-judge.md) | 噪声 Judge 高效推断 |
| [correctly-report-llm-judge](notes/llm-as-judge/correctly-report-llm-judge.md) | 规范报告方法 |
| [uncertainty-llm-judge](notes/llm-as-judge/uncertainty-llm-judge.md) | 不确定性分析 |
| [generative-ai-paradox](notes/llm-as-judge/generative-ai-paradox.md) | 生成式 AI 悖论 |
| [red-teaming-language-models](notes/llm-as-judge/red-teaming-language-models.md) | Red Teaming |
| [allure-auditing](notes/llm-as-judge/allure-auditing.md) | ALLURE 审计 |
| [chateval-multi-agent](notes/llm-as-judge/chateval-multi-agent.md) | 多 Agent 辩论评估 |
| [who-validates-validators](notes/llm-as-judge/who-validates-validators.md) | 谁来验证验证者 |
| [memalign](notes/llm-as-judge/memalign.md) | MemAlign 人类反馈 |
| [can-llms-replace-human-evaluators](notes/llm-as-judge/can-llms-replace-human-evaluators.md) | LLM 替代人类？ |
| [incentivizing-agentic-reasoning](notes/llm-as-judge/incentivizing-agentic-reasoning.md) | RL 激励推理 |

### Comprehensive Studies（5 篇）

| 笔记 | 主题 |
|------|------|
| [evaluation-openai-o1](notes/comprehensive-studies/evaluation-openai-o1.md) | OpenAI o1 评估 |
| [trustllm](notes/comprehensive-studies/trustllm.md) | TrustLLM 可信度 |
| [evaluating-ai-uncertain-ground-truth](notes/comprehensive-studies/evaluating-ai-uncertain-ground-truth.md) | 不确定 Ground Truth |
| [auditing-llms-human-in-loop](notes/comprehensive-studies/auditing-llms-human-in-loop.md) | 人在回路审计 |
| [prompts-data-prioritization](notes/comprehensive-studies/prompts-data-prioritization.md) | 数据优先级选择 |

---

## 来源分布统计

| 分类 | 笔记数 | 核心来源 |
|------|--------|---------|
| Reviews & Surveys | 10 | arxiv, ACL EMNLP |
| Leaderboards | 16 | HuggingFace, LMSys, Berkeley |
| Eval Software | 34 | EleutherAI, Microsoft, LangChain, Arize |
| Eval Articles | 25 | Anthropic, OpenAI, Nvidia, Mozilla, HuggingFace |
| Frontier Benchmarks | 23 | OpenAI, DeepMind, Scale AI |
| Meta-Evaluation | 36 | Anthropic, Stanford, NeurIPS, ICML |
| LLM-as-Judge | 37 | Tsinghua, Cohere, Grammarly, Google |
| Comprehensive Studies | 5 | OpenAI, Google, TrustAI |
| **综述** | **6** | 跨领域融合 |
| **总计** | **197** | |

---

## 如何使用本知识库

1. **快速了解全貌** → 阅读 6 篇综述
2. **选型决策** → 综述 3（工具生态）+ eval-software/ 目录
3. **设计评估方案** → 综述 1（方法论）+ 综述 4（Benchmark 设计）
4. **引入 LLM-as-Judge** → 综述 2 + llm-as-judge/ 目录
5. **提升评估质量** → 综述 6（元评估）+ meta-evaluation/ 目录

---

*Source: [github.com/alopatenko/LLMEvaluation](https://github.com/alopatenko/LLMEvaluation)*
