---
title: "Harness Design for Long-Running Agentic AI Development"
date: 2026-04-01
tags:
  - auto_ingested
language: en
key_points:
  - "Introduction of a three-agent architecture: planner, generator, and evaluator."
  - "Identification of 'context anxiety' where models wrap up work prematurely as limits approach."
  - "Implementation of 'context resets' over simple compaction to provide agents with a clean slate."
  - "Use of a standalone evaluator agent to provide skeptical feedback on both subjective and verifiable tasks."
  - "Defining concrete grading criteria (Design Quality, Originality, Craft, Functionality) to quantify aesthetic quality."
  - "The use of structured handoff artifacts to carry state across multiple autonomous sessions."
  - "Techniques inspired by Generative Adversarial Networks (GANs) applied to LLM workflows."
ingested_at: 2026-04-01T05:52:40.211441+00:00
---

## Summary

Anthropic engineer Prithvi Rajasekaran discusses new architectural approaches to improve Claude's performance in frontend design and autonomous coding. The work introduces a multi-agent harness utilizing generator and evaluator agents to overcome issues like 'context anxiety' and poor self-evaluation in long-running tasks.

## Content

[[Anthropic]] engineer [[Prithvi Rajasekaran]] details the evolution of [[harness design]] for agents performing [[autonomous software engineering]] and [[frontend design skill]]. To move beyond performance plateaus, a multi-agent structure inspired by [[Generative Adversarial Networks]] (GANs) was developed. This architecture uses a [[generator]] to create work and a skeptical [[evaluator]] to grade it based on specific criteria like [[Design quality]], [[Originality]], [[Craft]], and [[Functionality]]. 

A major hurdle in long-running tasks is [[context anxiety]], where models like [[Claude Sonnet 4.5]] become less effective as the context window fills. While some use compaction, [[Anthropic]] found that full 'context resets' combined with structured handoff artifacts are more effective for giving the next agent a clean slate. The final architecture involves a three-agent loop—[[planner]], [[generator]], and [[evaluator]]—to build full-stack applications over multi-hour sessions. This approach addresses the 'self-evaluation' problem where agents typically praise their own mediocre work, by separating the creation and judgment roles into distinct agents.
