# Benchmark Comparison Template

Harness A:

- completion rate
- average retries
- bugs caught before human review

Harness B:

- completion rate
- average retries
- bugs caught before human review

Interpretation:

- Which harness changed the result?
- Which harness changed the cost of getting the result?
