Rubrics and Judges

We no longer recommend authoring behavior rubrics by hand. The Docent plugin generates Reading steps inside an Analysis Plan for you. This page is kept for users with existing rubrics.

Docent helps you create and optimize LLM-based judges that evaluate agent runs against your criteria. A rubric is a configuration object that defines how a judge works, and a judge is a callable that evaluates an AgentRun using that rubric configuration. You can create and manage judges directly in the Docent web UI. This page focuses on the underlying data model and how to use the SDK to create and run judges programmatically.

The Data Model

A rubric defines the complete configuration for a judge. Here’s an example:

from docent.judges.types import Rubric, OutputParsingMode, PromptTemplateMessage
from docent._llm_util.providers.preference_types import ModelOption

rubric = Rubric(
    prompt_templates=[
        PromptTemplateMessage(
            role="user",
            content="""
Evaluate this agent run against the rubric.

Rubric:
<rubric>
{rubric}
</rubric>

Agent run:
<agent_run>
{agent_run}
</agent_run>

Output your evaluation as JSON in <response>...</response> tags.
Schema: {output_schema}
""",
        ),
    ],
    rubric_text="""
    Evaluate whether the agent successfully completed the user's request.

    Decision procedure:
    1. Identify what the user asked for
    2. Check if the agent's final response addresses the request
    3. Verify the response is accurate and complete
    """,
    output_schema={
        "type": "object",
        "properties": {
            "label": {"type": "string", "enum": ["pass", "fail"]},
            "explanation": {"type": "string", "citations": True},
        },
        "required": ["label", "explanation"],
        "additionalProperties": False,
    },
    judge_model=ModelOption(
        provider="openai",
        model_name="gpt-4o",
    ),
    output_parsing_mode=OutputParsingMode.XML_KEY,
    response_xml_key="response",
)

Let’s break down each of the key configuration options shown above.

prompt_templates

A list of messages that form the judge’s prompt. Templates must collectively include all three variables: {agent_run}, {rubric}, and {output_schema}. Variables can be distributed across multiple messages in the list.

from docent.judges.types import PromptTemplateMessage

rubric = Rubric(
    rubric_text="...",
    output_schema={...},
    prompt_templates=[
        PromptTemplateMessage(
            role="user",
            content="""
Evaluate this agent run against the rubric.

Rubric:
<rubric>
{rubric}
</rubric>

Agent run:
<agent_run>
{agent_run}
</agent_run>

Output your evaluation as JSON in <response>...</response> tags.
Schema: {output_schema}
""",
        ),
    ],
)

Template Variables Prompt templates must collectively include all three of these variables, which are automatically substituted:

Variable	Description
`{agent_run}`	The rendered transcript of the agent run being evaluated
`{rubric}`	The `rubric_text` field content
`{output_schema}`	JSON-formatted output schema
`{output_format_instructions}`	Optional. Format-specific instructions (JSON or YAML) selected by the rubric’s `output_format` field.

Validation will fail if any required variable is missing or if templates contain other undefined variables. {output_format_instructions} is optional — include it only if you want the prompt to surface format-specific guidance to the judge.

rubric_text

The core evaluation criteria that the judge follows. This text is substituted into the {rubric} template variable in the prompt. Write clear decision procedures that the judge can follow step-by-step. Be specific about what constitutes success or failure.

output_schema

A JSON schema defining the structure of the judge’s output.

output_schema={
    "type": "object",
    "properties": {
        "label": {"type": "string", "enum": ["pass", "fail"]},
        "score": {"type": "integer", "minimum": 1, "maximum": 5},
        "explanation": {"type": "string", "citations": True},
    },
    "required": ["label", "score", "explanation"],
    "additionalProperties": False,
}

Metaschema Rules We will post a link to the full JSON metadata schema soon. In the meantime, judge output schemas must generally follow these rules:

Root must be type: "object" with properties
additionalProperties is optional, but if present must be false
Supported types: string, integer, number, boolean, array, object (recursive objects are allowed)
Arrays require an items schema; objects require properties
Special: "citations": true on string fields enables transcript references
anyOf, oneOf, allOf are not supported

Labels share this meta-schema. Label sets, which define structured annotations for agent runs, use the same meta-schema validation. Any schema you define for a label set must follow these same rules.

Example with Citations

output_schema={
    "type": "object",
    "properties": {
        "issues": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string", "citations": True},
                    "severity": {"type": "string", "enum": ["low", "medium", "high"]},
                },
                "required": ["description", "severity"],
                "additionalProperties": False,
            },
        },
        "overall_score": {"type": "integer", "minimum": 1, "maximum": 10},
    },
    "required": ["issues", "overall_score"],
    "additionalProperties": False,
}

When "citations": true is set on a string field, the judge must include citations to specific parts of the transcript in its response. The Docent web UI automatically parses and links these citations. SDK users must manually convert results using JudgeResultWithCitations.from_judge_result() to resolve citation references.

judge_model

Specifies which LLM to use for evaluation. Uses ModelOption with provider, model name, and optional reasoning effort.

from docent._llm_util.providers.preference_types import ModelOption

rubric = Rubric(
    rubric_text="...",
    output_schema={...},
    judge_model=ModelOption(
        provider="openai",
        model_name="gpt-5",
        reasoning_effort="high",
    ),
)

Supported Providers

Provider String	Description
`openai`	OpenAI
`anthropic`	Anthropic
`google`	Google
`openrouter`	OpenRouter

Model Names Any model from the supported providers can be used as a judge. Use the exact model string that the provider uses:

OpenAI: gpt-4o, gpt-4o-mini, o1, o3-mini, etc.
Anthropic: claude-sonnet-4-5, claude-sonnet-4-20250514, etc.
Google: gemini-2.0-flash, gemini-1.5-pro, etc.
OpenRouter: Uses a different format with the provider prefix, e.g., anthropic/claude-3-opus, openai/gpt-4o

Reasoning Effort Some reasoning models support a reasoning_effort parameter that controls how much computation the model uses. Typical values are minimal, low, medium, and high. Not all models support this parameter—it is primarily available for OpenAI’s reasoning models (o1, o3-mini, etc.).

output_parsing_mode

Defines how the LLM output is parsed:

XML_KEY (default): Extract JSON from within XML tags (e.g., <response>...</response>). When using this mode, at least one prompt template must contain the XML tag <{response_xml_key}> (e.g., <response> by default).
CONSTRAINED_DECODING: Parse entire output as JSON (uses structured output). Supported by OpenAI, OpenRouter, and Anthropic. Not yet implemented for Google (will raise NotImplementedError).

from docent.judges.types import OutputParsingMode

rubric = Rubric(
    rubric_text="...",
    output_schema={...},
    output_parsing_mode=OutputParsingMode.CONSTRAINED_DECODING,
)

response_xml_key

When using XML_KEY parsing mode, specifies the tag name to extract the response from. Defaults to "response". Note: At least one prompt template must contain the corresponding XML tag (e.g., <answer>...</answer> if using response_xml_key="answer").

rubric = Rubric(
    rubric_text="...",
    output_schema={...},
    response_xml_key="answer",  # Extract from <answer>...</answer>
)

output_format

Selects the serialization format the judge is instructed to emit and that the SDK parses. Supported values are "yaml" (default for new rubrics) and "json".

rubric = Rubric(
    rubric_text="...",
    output_schema={...},
    output_format="yaml",
)

The {output_format_instructions} template variable, when present in a prompt template, is substituted with format-specific guidance derived from this field (for example, instructions on escaping for JSON or yaml.safe_load-compatible output for YAML). Output is still validated against output_schema regardless of format.

Rubrics created before the output_format field existed continue to behave as if output_format="json" was set, preserving backward compatibility.

SDK Methods

create_rubric()

Upload a rubric to a collection. Returns the rubric ID.

rubric_id = client.create_rubric(collection_id, rubric)

start_rubric_eval_job()

Start a rubric evaluation job for agent runs in a collection.

job_id = client.start_rubric_eval_job(
    collection_id,
    rubric_id,
    max_agent_runs=100,
    n_rollouts_per_input=1,
)

Use the method below to track the job progress and retrieve the results.

get_rubric_run_state()

Retrieve the current rubric evaluation results and job progress. This method does not start evaluation; use start_rubric_eval_job() first.

import time

job_id = client.start_rubric_eval_job(collection_id, rubric_id)

while True:
    run_state = client.get_rubric_run_state(collection_id, rubric_id)

    if run_state["job_id"] is None:
        break

    time.sleep(2)

results = run_state["results"]
print(f"Retrieved {len(results)} evaluated agent runs")

The response includes the current grouped judge results in results, plus progress metadata such as job_id, job_status, total_results_needed, and current_results_count while a job is still running.

get_rubric()

Retrieve a rubric configuration object by ID. Optionally specify a version.

rubric = client.get_rubric(collection_id, rubric_id)
rubric = client.get_rubric(collection_id, rubric_id, version=2)

get_judge()

Get a callable BaseJudge instance for running evaluations. Optionally specify a version.

judge = client.get_judge(collection_id, rubric_id)

# Run the judge on an agent run (async)
result = await judge(agent_run)

list_rubrics()

List all rubrics in a collection.

rubrics = client.list_rubrics(collection_id)

Running the Judge

The build_judge function creates an async callable that wraps LLM providers. It takes a Rubric configuration and an LLM service, and returns a judge you can call directly on any AgentRun.

import asyncio
from docent._llm_util.llm_svc import BaseLLMService
from docent.judges.impl import build_judge
from docent.judges.types import Rubric, ResultType
from docent.data_models.agent_run import AgentRun
from docent.data_models.transcript import Transcript
from docent.data_models.chat.message import UserMessage, AssistantMessage

# Define the rubric
rubric = Rubric(
    rubric_text="""
    Evaluate whether the agent provided a helpful and accurate response.

    Decision procedure:
    1. Check if the agent understood the user's question
    2. Verify the response directly addresses the question
    3. Assess accuracy of any factual claims
    """,
    output_schema={
        "type": "object",
        "properties": {
            "label": {"type": "string", "enum": ["helpful", "not helpful"]},
            "explanation": {"type": "string", "citations": True},
        },
        "required": ["label", "explanation"],
        "additionalProperties": False,
    },
)

# Create the LLM service (reads API keys from environment variables)
llm_svc = BaseLLMService()

# Build the judge
judge = build_judge(rubric, llm_svc)

# Create an agent run to evaluate
agent_run = AgentRun(
    transcripts=[
        Transcript(
            messages=[
                UserMessage(content="What is the capital of France?"),
                AssistantMessage(content="The capital of France is Paris."),
            ]
        )
    ]
)

# Run the judge (async)
async def evaluate():
    result = await judge(agent_run)

    if result.result_type == ResultType.DIRECT_RESULT:
        print(f"Label: {result.output['label']}")
        print(f"Explanation: {result.output['explanation']}")
    else:
        print(f"Judge failed: {result.result_metadata}")

asyncio.run(evaluate())

The BaseLLMService reads API keys from environment variables depending on the model provider:

OpenAI: OPENAI_API_KEY
Anthropic: ANTHROPIC_API_KEY
Google: GOOGLE_API_KEY
OpenRouter: OPENROUTER_API_KEY

JudgeResult

The judge returns a JudgeResult object with these fields:

Field	Type	Description
`id`	`str`	Unique identifier for this result
`agent_run_id`	`str`	ID of the evaluated agent run
`rubric_id`	`str`	ID of the rubric used
`rubric_version`	`int`	Version of the rubric
`output`	`dict[str, Any]`	Parsed output matching your `output_schema`
`result_metadata`	`dict[str, Any] \| None`	Additional metadata (contains errors on failure), or `None`
`result_type`	`ResultType`	`DIRECT_RESULT` on success, `FAILURE` on error

​The Data Model

​prompt_templates

​rubric_text

​output_schema

​judge_model

​output_parsing_mode

​response_xml_key

​output_format

​SDK Methods

​create_rubric()

​start_rubric_eval_job()

​get_rubric_run_state()

​get_rubric()

​get_judge()

​list_rubrics()

​Running the Judge

​JudgeResult

The Data Model

prompt_templates

rubric_text

output_schema

judge_model

output_parsing_mode

response_xml_key

output_format

SDK Methods

create_rubric()

start_rubric_eval_job()

get_rubric_run_state()

get_rubric()

get_judge()

list_rubrics()

Running the Judge

JudgeResult