Run Evaluations

Evaluation jobs run a rubric’s judge against agent runs in a collection. The evaluation runs server-side — you start the job and monitor progress. See Rubrics and Judges for evaluation concepts.

Start an Evaluation Job

from docent import Docent

client = Docent()

job_id = client.start_rubric_eval_job(
    "my-collection-id",
    rubric_id="rubric-123",
    max_agent_runs=500,
)
print(f"Started evaluation job: {job_id}")

Parameters

collection_id

str

required

ID of the collection.

rubric_id

str

required

ID of the rubric to evaluate with.

max_agent_runs

int | None

Maximum number of agent runs to evaluate. If None, evaluates all runs in the collection.

n_rollouts_per_input

int

default:"1"

Number of independent judge rollouts per agent run. More rollouts improve reliability at the cost of more LLM calls.

max_parallel

int | None

Backend concurrency limit for the evaluation job. If None, uses the server default.

include_metadata

bool

default:"True"

Whether the judge prompt should include agent run metadata.

Returns

job_id

str

ID of the created (or reused) evaluation job. If an identical job is already running, its ID is returned instead of creating a duplicate.

Get Evaluation Results

Retrieve the current state of a rubric evaluation, including results and progress.

state = client.get_rubric_run_state("my-collection-id", "rubric-123")
print(f"Total results needed: {state['total_results_needed']}")
print(f"Results so far: {len(state.get('results', []))}")

Parameters

collection_id

str

required

ID of the collection.

rubric_id

str

required

ID of the rubric.

version

int | None

Rubric version. If None, uses the latest version.

filter_dict

dict | None

Optional filter to apply to results.

include_failures

bool

default:"False"

Whether to include failed judge results in the response.

Returns

state

dict

Evaluation state.

Show Fields

results

list[dict]

List of per-agent-run result groups. Each entry contains:

Show AgentRunJudgeResults fields

agent_run_id

str

The agent run that was evaluated.

rubric_id

str

The rubric used.

rubric_version

int

The rubric version used.

results

list[dict]

List of individual judge results, each with output, result_type, and result_metadata.

reflection

dict | None

Reflection data, if the judge variant uses multi-reflection.

job_id

str | None

ID of the evaluation job, if one exists.

job_status

str | None

Status of the job: "pending", "running", "completed", or "canceled".

total_results_needed

int | None

Total number of results expected when evaluation is complete.

current_results_count

int | None

Number of results completed so far.

get_rubric_run_state does not start an evaluation. Use start_rubric_eval_job() first, then poll get_rubric_run_state() to check progress.

Example: Run and Monitor an Evaluation

import time
from docent import Docent

client = Docent()
collection_id = "my-collection-id"
rubric_id = "rubric-123"

# Start evaluation
job_id = client.start_rubric_eval_job(collection_id, rubric_id)
print(f"Started job: {job_id}")

# Poll for completion
while True:
    state = client.get_rubric_run_state(collection_id, rubric_id)
    current = state.get("current_results_count", 0)
    total = state.get("total_results_needed", 0)
    print(f"Progress: {current}/{total}")

    if state.get("job_status") in ("completed", "canceled"):
        break
    time.sleep(5)

# Analyze results — each entry groups judge results by agent run
for entry in state.get("results", []):
    for judge_result in entry["results"]:
        print(f"Run {entry['agent_run_id']}: {judge_result['output']}")

Getting Started

Collections

Agent Runs

Rubrics & Evaluation

DQL

Feedback & Labels

More

Run Evaluations

Start an Evaluation Job

Parameters

Returns

Get Evaluation Results

Parameters

Returns

Example: Run and Monitor an Evaluation

Getting Started

Collections

Agent Runs

Rubrics & Evaluation

DQL

Feedback & Labels

More

Documentation Index

​Start an Evaluation Job

​Parameters

​Returns

​Get Evaluation Results

​Parameters

​Returns

​Example: Run and Monitor an Evaluation

Start an Evaluation Job

Parameters

Returns

Get Evaluation Results

Parameters

Returns

Example: Run and Monitor an Evaluation