Analysis Plans

With Analysis Plans, you can define, execute, and verify analysis over collections of agent traces. An Analysis Plan contains executable steps and optional markdown notes:

A DQL step displays and executes a structured query. These steps can help filter, group, or aggregate over metadata, transcripts, or prior reading results. DQL steps are fast and deterministic.
A Reading step uses a language model to evaluate the results of a DQL query, which may return transcripts, metadata, prior Reading results, or text.
Markdown notes are queued in script order like other steps but are not executed. They are meant for plan context (for example, the behavior a rubric measures), usually one note at the top of the plan.

See Best Practices for tips on generating and revising Analysis Plans.

An Analysis Plan in the Docent UI: a sequence of DQL and Reading steps with their inputs, prompts, and outputs.

Common use-cases

Search and cluster

A reader evaluates each transcript independently for a behavior. A separate Reading step clusters the results.The per-transcript step applies a rubric to each transcript one at a time. For example: “Does the agent attempt to access files that don’t exist? If so, describe what it tried to access and why.” The reduce step takes those per-transcript results and groups them: “Cluster these file-access failures by root cause.”

/docent What are the main reasons why <YOUR_MODEL> fails on <YOUR_TASK>?

Search each failed run for the primary failure mode. Then cluster common failure modes across all runs.

- Collection ID: <YOUR_COLLECTION_ID>

Recursive clustering

After clustering your transcripts, create reading steps that cluster within categories to increase specificity.

/docent What are the main reasons why <YOUR_MODEL> fails on <YOUR_TASK>?

Identify runs where <YOUR_MODEL> failed. Summarize the primary failure modes and explain why you think they were decisive. Cluster common failure modes across all runs. For each of the top three failure modes, re-cluster the transcripts around more specific failures. The goal is to identify failures that are prevalent (common in the data) and specific (a developer can identify a concrete fix).

- Collection ID: <YOUR_COLLECTION_ID>

Pairwise comparison

Compare two models on the same tasks. A DQL step selects runs where one model regresses relative to the other, then a reading step identifies the main differences between a successful and a failed run on the same task.

/docent What are the main reasons why <NEW_MODEL> underperforms <OLD_MODEL>?

Identify tasks where <NEW_MODEL> regresses on average. On those tasks, compare a failed <NEW_MODEL> run against the successful <OLD_MODEL> runs. Summarize the main failure modes and analyze whether avoiding those failures was material to the result of the successful runs.

- Collection ID: <YOUR_COLLECTION_ID>

We used this workflow to investigate why GPT-5.1 Codex underperformed GPT-5 Codex on Terminal-Bench. See the writeup for the full report.

Analysis Plans are programs

Under the hood, an Analysis Plan is a Python script built with the Docent SDK. Your coding agent writes these scripts for you, but they’re ordinary code: you can read them, edit them, and re-run them. Analysis Plans is a lazily evaluated computation framework. Each call to client.query() or client.read() registers a step and immediately returns a lightweight handle. Handles feed into later steps, forming a dependency graph. When you ask for results, or when the script exits, the graph is submitted as an Analysis Plan. Docent executes the steps in dependency order, and the results flow back into your script as plain Python objects. Here’s a complete search-and-cluster pipeline:

from docent import Docent

client = Docent()
collection_id = "<your-collection-id>"

# Step $1: a DQL step selecting inputs (no LLM involved)
sampled = client.query(
    collection_id,
    "SELECT transcripts.id AS transcript FROM transcripts ORDER BY transcripts.id LIMIT 100",
    name="Sample 100 transcripts",
)

# Step $2: a Reading step that runs once per row of $1
summarize = client.read(
    prompt_template=[
        sampled.transcript.as_type("transcript"),
        "Write a 1-2 sentence summary of any mistakes the agent made.",
    ],
    model="openai/gpt-5.4-mini",
    name="Summarize mistakes per transcript",
)

# Step $3: a DQL step that gathers all of $2's outputs into one row.
# The f-string interpolates the Reading handle as its alias ('$2');
# the server substitutes the real reading ID at execution time.
summaries = client.query(
    collection_id,
    f"""
    SELECT array_agg(rr.id ORDER BY rr.id) AS summaries
    FROM reading_results rr
    JOIN reading_result_links rrl ON rrl.result_id = rr.id
    WHERE rrl.reading_id = '{summarize}'
    """,
    name="Collect all summaries",
)

# Step $4: a Reading step that sees every summary at once
clusters = client.read(
    prompt_template=[
        "Cluster these mistake summaries into 5-10 categories: ",
        summaries.summaries.as_type("reading_result", is_list=True),
    ],
    model="openai/gpt-5.5",
    name="Cluster mistake summaries",
)

# Nothing has run yet. Accessing .results forces evaluation of $4 and
# everything upstream, blocks until complete, and returns the output.
print(clusters.results[0].output)

How DQL rows feed into readings

Every Reading step takes its inputs from a DQL query, which returns a table. Each row of that table becomes one LLM call. The columns of the table fill in the prompt. Accessing an attribute on a query handle, like sampled.transcript in step $2, gives a reference to that column. Where the reference appears in the prompt template, each row’s value is substituted. The .as_type(...) annotation controls how the value is rendered: as_type("text") embeds the value literally, while types like "transcript" or "agent_run" treat the value as an ID and render the full object for the judge. This means the DQL query controls both what each judge call sees and how many calls there are. Step $1 returns 100 rows, so step $2 makes 100 LLM calls, one per transcript. Step $3 uses array_agg to collapse all of step $2’s outputs into a single row with a list-valued column, so step $4 makes one LLM call that sees every summary at once. Going from “one call per item” to “one call over all items” is just a change to the query.

Working with the graph

A few properties fall out of this design:

Plans are built with ordinary Python. You can create steps in loops, build prompts with string formatting, and wrap common patterns in functions.
Dependencies are tracked for you. Referencing a query’s column in a prompt ties the reading to that query. Interpolating a Reading handle into a DQL string ties the query to that reading. Docent infers the execution order from these references, so you never schedule anything yourself.
Results are available whenever you want them. Accessing reading.results mid-script runs that step and everything upstream of it. The outputs come back as plain dicts, so you can use ordinary Python to shape later steps. For example, you can take the cluster names proposed by one reading and use them as the enum values in the next reading’s output schema.
Re-running is cheap. Steps are content-addressed. When you re-run a script, any step whose inputs and configuration are unchanged reuses its cached results. The standard way to iterate is to append steps to the script and re-run the whole thing; only the new steps execute.

When the plan is submitted, its Reading steps appear in the UI for review, and the script blocks until you approve them. Call client.flush(auto_approve=True) to skip manual approval.

Creating and executing Analysis Plans

Analysis Plans display in the UI after your coding agent writes and executes a script calling Docent’s analysis tools. The UI view is read-only: to make revisions, instruct your coding agent to change the plan. Reading steps may require your approval before running. You can approve individual reading steps by clicking the Approve button in the top right corner of the step. You can also approve all pending steps by clicking the Approve All button in the top right of the page. Steps that are waiting on your approval will display in purple on the minimap.

The approval view for an Analysis Plan in the Docent UI, where you review each step before it runs.

Steps will display a Results table after they have run.

​Common use-cases

​Analysis Plans are programs

​How DQL rows feed into readings

​Working with the graph

​Creating and executing Analysis Plans

Common use-cases

Analysis Plans are programs

How DQL rows feed into readings

Working with the graph

Creating and executing Analysis Plans