Docent Agent

Find common failure modes

Surface the recurring ways your agent fails so you know what to fix next. Give the Docent Agent a collection, define what “actionable” and “prevalent” mean for your case, and let it recursively cluster the failures until each category is specific enough to act on.

/docent Identify decisive failure modes in runs from GPT-5.1. Your goal is to identify actionable, prevalent failure modes. Actionable failure modes are ones that are specific enough that the model developer could take a clear next step to address them. Prevalent failure modes are common in the dataset.

To identify these failure modes, recursively cluster failures by using a judge to identify the key failure modes in each transcript, clustering those into common categories, and then sub-clustering the transcripts in each cluster until you reach a sufficiently actionable and prevalent insight.

- Collection ID: 479b7093-5a33-47f1-8d7b-fc9f6f16bb75
- Auto accept reading plan edits and generate a report

Diagnose a regression

When one model version underperforms another, pinpoint the tasks where the regression shows up, compare failed runs to successful runs on the same task, and check whether the failure modes you identify actually account for the score gap.

/docent What are the main reasons why GPT-5.1 Codex underperforms GPT-5 Codex? Identify tasks where GPT-5.1 regresses on average. On those tasks, compare a failed GPT-5.1 run against the successful GPT-5 runs. Summarize the main failure modes and analyze whether avoiding those failures was material to the result of the successful runs.

- Collection ID: 479b7093-5a33-47f1-8d7b-fc9f6f16bb75
- Auto accept reading plan and generate a report

We used this workflow to investigate why GPT-5.1 Codex underperformed GPT-5 Codex on Terminal-Bench. See the writeup for the full report.

Refine a behavior rubric

Turn insights from your report into a judge you can run over the whole collection.

Write DQL queries

Pull specific slices of runs and metadata directly with SQL.

Search and cluster

Find behaviors in the UI and group results automatically.

Export data

Download transcripts and metadata for local analysis.

Get Started

Ingestion

Analysis

Data Models

Support

What you can do

Tips

What’s next

Refine a behavior rubric

Write DQL queries

Search and cluster

Export data

Get Started

Ingestion

Analysis

Data Models

Support

​What you can do

​Tips

​What’s next

Refine a behavior rubric

Write DQL queries

Search and cluster

Export data

What you can do

Tips

What’s next