We no longer recommend authoring behavior rubrics by hand. The Docent plugin generates Reading steps inside an Analysis Plan for you. This page is kept for users with existing rubrics.
The Data Model
A rubric defines the complete configuration for a judge. Here’s an example:prompt_templates
A list of messages that form the judge’s prompt. Templates must collectively include all three variables:{agent_run}, {rubric}, and {output_schema}. Variables can be distributed across multiple messages in the list.
| Variable | Description |
|---|---|
{agent_run} | The rendered transcript of the agent run being evaluated |
{rubric} | The rubric_text field content |
{output_schema} | JSON-formatted output schema |
{output_format_instructions} | Optional. Format-specific instructions (JSON or YAML) selected by the rubric’s output_format field. |
{output_format_instructions} is optional — include it only if you want the prompt to surface format-specific guidance to the judge.
rubric_text
The core evaluation criteria that the judge follows. This text is substituted into the{rubric} template variable in the prompt.
Write clear decision procedures that the judge can follow step-by-step. Be specific about what constitutes success or failure.
output_schema
A JSON schema defining the structure of the judge’s output.- Root must be
type: "object"withproperties additionalPropertiesis optional, but if present must befalse- Supported types:
string,integer,number,boolean,array,object(recursive objects are allowed) - Arrays require an
itemsschema; objects requireproperties - Special:
"citations": trueon string fields enables transcript references anyOf,oneOf,allOfare not supported
Labels share this meta-schema. Label sets, which define structured annotations for agent runs, use the same meta-schema validation. Any schema you define for a label set must follow these same rules.
"citations": true is set on a string field, the judge must include citations to specific parts of the transcript in its response. The Docent web UI automatically parses and links these citations. SDK users must manually convert results using JudgeResultWithCitations.from_judge_result() to resolve citation references.
judge_model
Specifies which LLM to use for evaluation. UsesModelOption with provider, model name, and optional reasoning effort.
| Provider String | Description |
|---|---|
openai | OpenAI |
anthropic | Anthropic |
google | |
openrouter | OpenRouter |
- OpenAI:
gpt-4o,gpt-4o-mini,o1,o3-mini, etc. - Anthropic:
claude-sonnet-4-5,claude-sonnet-4-20250514, etc. - Google:
gemini-2.0-flash,gemini-1.5-pro, etc. - OpenRouter: Uses a different format with the provider prefix, e.g.,
anthropic/claude-3-opus,openai/gpt-4o
reasoning_effort parameter that controls how much computation the model uses. Typical values are minimal, low, medium, and high. Not all models support this parameter—it is primarily available for OpenAI’s reasoning models (o1, o3-mini, etc.).
output_parsing_mode
Defines how the LLM output is parsed:XML_KEY(default): Extract JSON from within XML tags (e.g.,<response>...</response>). When using this mode, at least one prompt template must contain the XML tag<{response_xml_key}>(e.g.,<response>by default).CONSTRAINED_DECODING: Parse entire output as JSON (uses structured output). Supported by OpenAI, OpenRouter, and Anthropic. Not yet implemented for Google (will raiseNotImplementedError).
response_xml_key
When usingXML_KEY parsing mode, specifies the tag name to extract the response from. Defaults to "response".
Note: At least one prompt template must contain the corresponding XML tag (e.g., <answer>...</answer> if using response_xml_key="answer").
output_format
Selects the serialization format the judge is instructed to emit and that the SDK parses. Supported values are"yaml" (default for new rubrics) and "json".
{output_format_instructions} template variable, when present in a prompt template, is substituted with format-specific guidance derived from this field (for example, instructions on escaping for JSON or yaml.safe_load-compatible output for YAML). Output is still validated against output_schema regardless of format.
Rubrics created before the
output_format field existed continue to behave
as if output_format="json" was set, preserving backward compatibility.SDK Methods
create_rubric()
Upload a rubric to a collection. Returns the rubric ID.start_rubric_eval_job()
Start a rubric evaluation job for agent runs in a collection.get_rubric_run_state()
Retrieve the current rubric evaluation results and job progress. This method does not start evaluation; usestart_rubric_eval_job() first.
results, plus progress metadata such as job_id, job_status, total_results_needed, and current_results_count while a job is still running.
get_rubric()
Retrieve a rubric configuration object by ID. Optionally specify a version.get_judge()
Get a callableBaseJudge instance for running evaluations. Optionally specify a version.
list_rubrics()
List all rubrics in a collection.Running the Judge
Thebuild_judge function creates an async callable that wraps LLM providers. It takes a Rubric configuration and an LLM service, and returns a judge you can call directly on any AgentRun.
BaseLLMService reads API keys from environment variables depending on the model provider:
- OpenAI:
OPENAI_API_KEY - Anthropic:
ANTHROPIC_API_KEY - Google:
GOOGLE_API_KEY - OpenRouter:
OPENROUTER_API_KEY
JudgeResult
The judge returns aJudgeResult object with these fields:
| Field | Type | Description |
|---|---|---|
id | str | Unique identifier for this result |
agent_run_id | str | ID of the evaluated agent run |
rubric_id | str | ID of the rubric used |
rubric_version | int | Version of the rubric |
output | dict[str, Any] | Parsed output matching your output_schema |
result_metadata | dict[str, Any] | None | Additional metadata (contains errors on failure), or None |
result_type | ResultType | DIRECT_RESULT on success, FAILURE on error |

