- Explore your data. Docent supports fast structured queries such as “Display average reward by model” and unstructured exploration such as surfacing primary failure modes and grouping traces that display each one
- Quantify behavior prevalence. Measure behaviors like “sycophancy” and “reading irrelevant files” by using Docent to create reliable judges.
- Aggregate expert feedback. Use Docent to collaboratively annotate and label your traces. Use labels to inform your judges.
Get started
Explore with the Docent Agent
Use the Docent Agent to surface new behaviors. Ask for insights like “Identify the main failure modes that explain why my agent fails on Terminal-Bench” or “Display average reward by model” and receive a report of its findings.
Refine a judge
Use refinement to quantify behavior prevalence. Docent’s refinement tools turn fuzzy behaviors like “sycophancy” or “cheating” into detailed decision-procedure that an LLM judge can reliably apply.
Related pages
- Behavior rubrics: what rubrics are and how judges evaluate them.
- Structured queries (DQL): the query language the agent uses for quantitative questions.
- Search and clustering: semantic search over runs, with clustering for grouping results.
- Labeling: capture human judgments on specific runs.
- Exporting: download transcripts and metadata for local analysis.

