Skip to main content
Evaluation jobs run a rubric’s judge against agent runs in a collection. The evaluation runs server-side — you start the job and monitor progress. See Rubrics and Judges for evaluation concepts.

Start an Evaluation Job

from docent import Docent

client = Docent()

job_id = client.start_rubric_eval_job(
    "my-collection-id",
    rubric_id="rubric-123",
    max_agent_runs=500,
)
print(f"Started evaluation job: {job_id}")

Parameters

collection_id
str
required
ID of the collection.
rubric_id
str
required
ID of the rubric to evaluate with.
max_agent_runs
int | None
Maximum number of agent runs to evaluate. If None, evaluates all runs in the collection.
n_rollouts_per_input
int
default:"1"
Number of independent judge rollouts per agent run. More rollouts improve reliability at the cost of more LLM calls.
max_parallel
int | None
Backend concurrency limit for the evaluation job. If None, uses the server default.
include_metadata
bool
default:"True"
Whether the judge prompt should include agent run metadata.

Returns

job_id
str
ID of the created (or reused) evaluation job. If an identical job is already running, its ID is returned instead of creating a duplicate.

Get Evaluation Results

Retrieve the current state of a rubric evaluation, including results and progress.
state = client.get_rubric_run_state("my-collection-id", "rubric-123")
print(f"Total results needed: {state['total_results_needed']}")
print(f"Results so far: {len(state.get('results', []))}")

Parameters

collection_id
str
required
ID of the collection.
rubric_id
str
required
ID of the rubric.
version
int | None
Rubric version. If None, uses the latest version.
filter_dict
dict | None
Optional filter to apply to results.
include_failures
bool
default:"False"
Whether to include failed judge results in the response.

Returns

state
dict
Evaluation state.
get_rubric_run_state does not start an evaluation. Use start_rubric_eval_job() first, then poll get_rubric_run_state() to check progress.

Example: Run and Monitor an Evaluation

import time
from docent import Docent

client = Docent()
collection_id = "my-collection-id"
rubric_id = "rubric-123"

# Start evaluation
job_id = client.start_rubric_eval_job(collection_id, rubric_id)
print(f"Started job: {job_id}")

# Poll for completion
while True:
    state = client.get_rubric_run_state(collection_id, rubric_id)
    current = state.get("current_results_count", 0)
    total = state.get("total_results_needed", 0)
    print(f"Progress: {current}/{total}")

    if state.get("job_status") in ("completed", "canceled"):
        break
    time.sleep(5)

# Analyze results — each entry groups judge results by agent run
for entry in state.get("results", []):
    for judge_result in entry["results"]:
        print(f"Run {entry['agent_run_id']}: {judge_result['output']}")