Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.transluce.org/llms.txt

Use this file to discover all available pages before exploring further.

When you use the Docent plugin, your coding agent generates writes a Python script that calls Docent’s analysis tools. These operations show up in the Docent UI as an Analysis Plan that you can review and approve. An Analysis Plan contains two kinds of steps:
  • A DQL step displays and executes a structured query. These steps can help filter, group, or aggregate over metadata, transcripts, or prior reading results. DQL steps are fast and deterministic.
  • A Reading step uses a language model to evaluate the results of a DQL query, which may return transcripts, metadata, prior Reading results, or text.
Analysis Plans in the UI are read-only. To make revisions, instruct your coding agent to change the plan. See Best Practices for tips on generating and revising Analysis Plans. An Analysis Plan in the Docent UI: a sequence of DQL and Reading steps with their inputs, prompts, and outputs.

Creating and executing Analysis Plans

Analysis Plans display in the UI after your coding agent writes and executes a script calling Docent’s analysis tools. Reading steps may require your approval before running. You can approve individual reading steps by clicking the Approve button in the top right corner of the step. You can also approve all pending steps by clicking the Approve All button in the top right of the page. Steps that are waiting on your approval will display in purple on the minimap. The approval view for an Analysis Plan in the Docent UI, where you review each step before it runs. Steps will display a Results table after they have run.

Common patterns

Search and cluster

A reader evaluates each transcript independently for a behavior. A separate Reading step clusters the results.The per-transcript step applies a rubric to each transcript one at a time. For example: “Does the agent attempt to access files that don’t exist? If so, describe what it tried to access and why.” The reduce step takes those per-transcript results and groups them: “Cluster these file-access failures by root cause.”
/docent What are the main reasons why <YOUR_MODEL> fails on <YOUR_TASK>?

Search each failed run for the primary failure mode. Then cluster common failure modes across all runs.

- Collection ID: <YOUR_COLLECTION_ID>
After clustering your transcripts, create reading steps that cluster within categories to increase specificity.
/docent What are the main reasons why <YOUR_MODEL> fails on <YOUR_TASK>?

Identify runs where <YOUR_MODEL> failed. Summarize the primary failure modes and explain why you think they were decisive. Cluster common failure modes across all runs. For each of the top three failure modes, re-cluster the transcripts around more specific failures. The goal is to identify failures that are prevalent (common in the data) and specific (a developer can identify a concrete fix).

- Collection ID: <YOUR_COLLECTION_ID>
Compare two models on the same tasks. A DQL step selects runs where one model regresses relative to the other, then a reading step identifies the main differences between a successful and a failed run on the same task.
/docent What are the main reasons why <NEW_MODEL> underperforms <OLD_MODEL>?

Identify tasks where <NEW_MODEL> regresses on average. On those tasks, compare a failed <NEW_MODEL> run against the successful <OLD_MODEL> runs. Summarize the main failure modes and analyze whether avoiding those failures was material to the result of the successful runs.

- Collection ID: <YOUR_COLLECTION_ID>
We used this workflow to investigate why GPT-5.1 Codex underperformed GPT-5 Codex on Terminal-Bench. See the writeup for the full report.