Use Cases
Coding agents can flexibly automate multi-step manual workflows and conduct investigations in parallel. Try:- Comparing between checkpoints. Suppose you want to answer questions like “What caused a regression between checkpoint A and checkpoint B?” or “These two checkpoints show a quantitative tradeoff on eval scores. What behaviors might explain that?” Prompting the agent to compare pairs of successful and failed runs helps identify changes between multiple versions of a model. See a detailed tutorial below.
- Triaging before deep analysis: Improve cost-efficiency by focusing your expensive classifier runs. Prompt your agent to:
- Do a first pass with a cheap model for tagging possible issues (“sycophancy,” “copyright,” “biorisk,” etc), conservatively including potential false positives
- Then use a more expensive model with detailed, issue-specific rubrics over only the relevant parts of the collection. Assign a detailed sycophancy classifier to run over the transcripts tagged “sycophancy” and a copyright classifier to run over the transcripts tagged “copyright.”
- Managing long context: Flexibly control context when analyzing collections with long traces or many transcripts. To summarize failures across a large collection, your agent can batch transcripts, extract key observations from each batch, and then clustering and observe patterns from the batched observations. Recursively passing the results from previous analyses improves your agent’s ability to handle collection-level queries at scale.
Quickstart
You can skip this section if you’ve installed our Claude Code plugin. We strongly recommend installing our Claude Code plugin so that you receive automatic updates.
- From Template
- Existing Workspace
Step 1: Download and open the template for your preferred IDE
Make sure to open the template in its own window in your IDE so that configuration files are recognized.The template directory contains:pyproject.toml: a minimal Python project configuration listingdocent-pythonas a dependency.docent.env: contains configurable environment variables. You’ll need to setDOCENT_API_KEYandDOCENT_COLLECTION_ID.cursor/rules/docent.mdorAGENTS.md: instructions for how the agent should use Docent..cursor/mcp.jsonor.vscode/mcp.json: configuration for the Docent MCP server, which provides tools to let your agent interact with Docent.
.cursor/ or .vscode/ directory is often hidden in your file explorer.
Step 2: Fill in DOCENT_API_KEY and DOCENT_COLLECTION_ID in docent.env
- You can generate a Docent API key at this link or navigate there from the Dashboard by clicking on Settings → API Keys.
- The
DOCENT_COLLECTION_IDis the UUID of the collection you want to analyze. You can find it in two places:
From the collections table
From the collections table
Click the ID next to any collection in the Docent Dashboard to copy it.

From within a collection
From within a collection
When viewing a collection, the ID appears in the header next to the collection name.

Step 3: Prompt the agent
For best results, describe your target workflow as precisely as possible.
Using Docent to Compare Between Checkpoints
Suppose we have two agents and we want to qualitatively understand why one is failing more often on a set of tasks. There are a few ways to approach this question. One is to summarize the cause of each individual failure, then look for high-level differences between the two agents. If we wanted to do this with Docent, we might first prompt our coding agent:For each run in the collection where reward = 0, compare it to another run of the same task where reward = 1 and write a short paragraph explaining the cause of the failure. Skip tasks with no successful runs. Analyze each failed run, not just one per task.After checking the structure of the relevant metadata, the agent might write something like this:
Generated script: Compare failed runs to successful runs
Generated script: Compare failed runs to successful runs
Make a new script to group these results by the agent of the run being analyzed. Then look at all the results, grouped by agent, and summarize the differences in failure modes between the models
Generated script: Summarize failure modes by model
Generated script: Summarize failure modes by model
Best Practices & Future Work
- When selecting a model for your coding agent, Claude Sonnet 4.5 is a good place to start. Smaller models often get confused about various aspects of the analysis workflow.
- For best results, be precise when telling the coding agent what workflow you want to execute.
- The LLM can only cite items that were directly passed as part of the prompt. It cannot, for example, cite items that were cited by a result that was passed as part of the prompt. If the LLM gets confused about this fact, mention it in your prompt. We’re working on a cleaner solution.
- We’re currently adding functionality to analyze Docent judge results. Stay tuned!

