Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.transluce.org/llms.txt

Use this file to discover all available pages before exploring further.

The Docent Agent investigates behavior across an entire collection of agent runs. It combines structured queries over your run metadata with LLM-driven analysis of individual transcripts. A typical investigation might compare scores between two model versions, pinpoint where one regressed, and read the failing transcripts to explain why. You get a report that cites every run behind each claim. Use the /docent slash command in Claude Code, Cursor, or any IDE where you’ve installed the skill. See the Quickstart to set it up.

What you can do

Here are some ways to use the Docent Agent:

Diagnose a regression

When one model version underperforms another, pinpoint the tasks where the regression shows up, compare failed runs to successful runs on the same task, and check whether the failure modes you identify are also present in successful runs.
/docent What are the main reasons why GPT-5.1 Codex underperforms GPT-5 Codex?

Identify tasks where GPT-5.1 regresses on average. On those tasks, compare a failed GPT-5.1 run against the successful GPT-5 runs. Summarize the main failure modes and analyze whether avoiding those failures was material to the result of the successful runs.

- Collection ID: 479b7093-5a33-47f1-8d7b-fc9f6f16bb75
- Auto accept reading plan and generate a report
We used this workflow to investigate why GPT-5.1 Codex underperformed GPT-5 Codex on Terminal-Bench. See the writeup for the full report.
Surface the recurring ways your agent fails so you know what to fix next. Give the Docent Agent a collection, define what “actionable” and “prevalent” mean for your case, and let it recursively cluster the failures until each category is specific enough to act on.
/docent What are the main reasons why GPT-5.1 Codex fails?

Identify runs where GPT-5.1 failed. Summarize the primary failure modes in those runs and explain why you think they were decisive. Cluster common failure modes or failing strategies across all runs. Continue to cluster within clusters until you reach failures that are prevalent (i.e. common in the data) and specific (i.e. it is evident to a developer what a concrete fix would look like).

- Collection ID: 479b7093-5a33-47f1-8d7b-fc9f6f16bb75
- Auto accept reading plan and generate a report

Tips

  • Be precise about the workflow. Name the metadata fields, the comparison you want, and how to group results. The agent plans better when it knows exactly what “failure” or “regression” means in your collection.
  • Review the Reading Plan to verify claims in the report and understand how the Docent Agent operationalized your instructions.
  • Citations only reach as far as the prompt. Each step in the reading plan can only cite items directly passed into it, not items transitively cited by earlier steps.

What’s next

Refine a behavior rubric

Turn insights from your report into a judge you can run over the whole collection.

Write DQL queries

Pull specific slices of runs and metadata directly with SQL.

Search and cluster

Find behaviors in the UI and group results automatically.

Export data

Download transcripts and metadata for local analysis.