Analysis Overview

Get started
Related pages

Docent’s analysis tools take you from a collection of agent runs to measurable insights about agent behavior.

Explore your data. Docent supports fast structured queries such as “Display average reward by model” and unstructured exploration such as surfacing primary failure modes and grouping traces that display each one
Quantify behavior prevalence. Measure behaviors like “sycophancy” and “reading irrelevant files” by using Docent to create reliable judges.
Aggregate expert feedback. Use Docent to collaboratively annotate and label your traces. Use labels to inform your judges.

Get started

Explore with the Docent Agent

Use the Docent Agent to surface new behaviors. Ask for insights like “Identify the main failure modes that explain why my agent fails on Terminal-Bench” or “Display average reward by model” and receive a report of its findings.

Refine a judge

Use refinement to quantify behavior prevalence. Docent’s refinement tools turn fuzzy behaviors like “sycophancy” or “cheating” into detailed decision-procedure that an LLM judge can reliably apply.

Behavior rubrics: what rubrics are and how judges evaluate them.
Structured queries (DQL): the query language the agent uses for quantitative questions.
Search and clustering: semantic search over runs, with clustering for grouping results.
Labeling: capture human judgments on specific runs.
Exporting: download transcripts and metadata for local analysis.

NeMo Gym

Docent Agent

⌘I

Get Started

Ingestion

Analysis

Data Models

Support

Analysis Overview

Get started

Explore with the Docent Agent

Refine a judge

Get Started

Ingestion

Analysis

Data Models

Support

​Get started

Explore with the Docent Agent

Refine a judge

​Related pages

Get started

Related pages