Skip to main content
Docent’s analysis tools take you from a collection of agent runs to measurable insights about agent behavior.
  • Explore your data. Docent supports fast structured queries such as “Display average reward by model” and unstructured exploration such as surfacing primary failure modes and grouping traces that display each one
  • Quantify behavior prevalence. Measure behaviors like “sycophancy” and “reading irrelevant files” by using Docent to create reliable judges.
  • Aggregate expert feedback. Use Docent to collaboratively annotate and label your traces. Use labels to inform your judges.

Get started

Explore with the Docent Agent

Use the Docent Agent to surface new behaviors. Ask for insights like “Identify the main failure modes that explain why my agent fails on Terminal-Bench” or “Display average reward by model” and receive a report of its findings.

Refine a judge

Use refinement to quantify behavior prevalence. Docent’s refinement tools turn fuzzy behaviors like “sycophancy” or “cheating” into detailed decision-procedure that an LLM judge can reliably apply.