/docent slash command in Claude Code, Cursor, or any IDE where you’ve installed the skill. See the Quickstart to set it up.
What you can do
Here are some ways to use the Docent Agent:Find common failure modes
Find common failure modes
Surface the recurring ways your agent fails so you know what to fix next. Give the Docent Agent a collection, define what “actionable” and “prevalent” mean for your case, and let it recursively cluster the failures until each category is specific enough to act on.
Diagnose a regression
Diagnose a regression
When one model version underperforms another, pinpoint the tasks where the regression shows up, compare failed runs to successful runs on the same task, and check whether the failure modes you identify actually account for the score gap.
We used this workflow to investigate why GPT-5.1 Codex underperformed GPT-5 Codex on Terminal-Bench. See the writeup for the full report.
Tips
- Be precise about the workflow. Name the metadata fields, the comparison you want, and how to group results. The agent plans better when it knows exactly what “failure” or “regression” means in your collection.
- The reading plan is your audit trail. Open it to trace any claim in the report back to the runs that produced it.
- Citations only reach as far as the prompt. Each step in the reading plan can only cite items directly passed into it, not items transitively cited by earlier steps.
What’s next
Refine a behavior rubric
Turn insights from your report into a judge you can run over the whole collection.
Write DQL queries
Pull specific slices of runs and metadata directly with SQL.
Search and cluster
Find behaviors in the UI and group results automatically.
Export data
Download transcripts and metadata for local analysis.

