Skip to main content
The Docent Agent investigates behavior across an entire collection of agent runs. It combines structured queries over your run metadata with LLM-driven analysis of individual transcripts. A typical investigation might compare scores between two model versions, pinpoint where one regressed, and read the failing transcripts to explain why. You get a report that cites every run behind each claim. Use the /docent slash command in Claude Code, Cursor, or any IDE where you’ve installed the skill. See the Quickstart to set it up.

What you can do

Here are some ways to use the Docent Agent:

Find common failure modes

Surface the recurring ways your agent fails so you know what to fix next. Give the Docent Agent a collection, define what “actionable” and “prevalent” mean for your case, and let it recursively cluster the failures until each category is specific enough to act on.
/docent Identify decisive failure modes in runs from GPT-5.1. Your goal is to identify actionable, prevalent failure modes. Actionable failure modes are ones that are specific enough that the model developer could take a clear next step to address them. Prevalent failure modes are common in the dataset.

To identify these failure modes, recursively cluster failures by using a judge to identify the key failure modes in each transcript, clustering those into common categories, and then sub-clustering the transcripts in each cluster until you reach a sufficiently actionable and prevalent insight.

- Collection ID: 479b7093-5a33-47f1-8d7b-fc9f6f16bb75
- Auto accept reading plan edits and generate a report
When one model version underperforms another, pinpoint the tasks where the regression shows up, compare failed runs to successful runs on the same task, and check whether the failure modes you identify actually account for the score gap.
/docent What are the main reasons why GPT-5.1 Codex underperforms GPT-5 Codex? Identify tasks where GPT-5.1 regresses on average. On those tasks, compare a failed GPT-5.1 run against the successful GPT-5 runs. Summarize the main failure modes and analyze whether avoiding those failures was material to the result of the successful runs.

- Collection ID: 479b7093-5a33-47f1-8d7b-fc9f6f16bb75
- Auto accept reading plan and generate a report
We used this workflow to investigate why GPT-5.1 Codex underperformed GPT-5 Codex on Terminal-Bench. See the writeup for the full report.

Tips

  • Be precise about the workflow. Name the metadata fields, the comparison you want, and how to group results. The agent plans better when it knows exactly what “failure” or “regression” means in your collection.
  • The reading plan is your audit trail. Open it to trace any claim in the report back to the runs that produced it.
  • Citations only reach as far as the prompt. Each step in the reading plan can only cite items directly passed into it, not items transitively cited by earlier steps.

What’s next

Refine a behavior rubric

Turn insights from your report into a judge you can run over the whole collection.

Write DQL queries

Pull specific slices of runs and metadata directly with SQL.

Search and cluster

Find behaviors in the UI and group results automatically.

Export data

Download transcripts and metadata for local analysis.