> ## Documentation Index
> Fetch the complete documentation index at: https://docs.transluce.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Run Evaluations

> Start evaluation jobs and track their progress

Evaluation jobs run a rubric's judge against agent runs in a collection.
The evaluation runs server-side — you start the job and monitor progress.

See [Rubrics and Judges](/analysis/rubrics) for evaluation concepts.

## Start an Evaluation Job

```python theme={null}
from docent import Docent

client = Docent()

job_id = client.start_rubric_eval_job(
    "my-collection-id",
    rubric_id="rubric-123",
    max_agent_runs=500,
)
print(f"Started evaluation job: {job_id}")
```

### Parameters

<ParamField body="collection_id" type="str" required>
  ID of the collection.
</ParamField>

<ParamField body="rubric_id" type="str" required>
  ID of the rubric to evaluate with.
</ParamField>

<ParamField body="max_agent_runs" type="int | None">
  Maximum number of agent runs to evaluate. If `None`, evaluates all runs in the collection.
</ParamField>

<ParamField body="n_rollouts_per_input" type="int" default="1">
  Number of independent judge rollouts per agent run. More rollouts improve reliability
  at the cost of more LLM calls.
</ParamField>

<ParamField body="max_parallel" type="int | None">
  Backend concurrency limit for the evaluation job. If `None`, uses the server default.
</ParamField>

<ParamField body="include_metadata" type="bool" default="True">
  Whether the judge prompt should include agent run metadata.
</ParamField>

### Returns

<ResponseField name="job_id" type="str">
  ID of the created (or reused) evaluation job. If an identical job is already running,
  its ID is returned instead of creating a duplicate.
</ResponseField>

***

## Get Evaluation Results

Retrieve the current state of a rubric evaluation, including results and progress.

```python theme={null}
state = client.get_rubric_run_state("my-collection-id", "rubric-123")
print(f"Total results needed: {state['total_results_needed']}")
print(f"Results so far: {len(state.get('results', []))}")
```

### Parameters

<ParamField body="collection_id" type="str" required>
  ID of the collection.
</ParamField>

<ParamField body="rubric_id" type="str" required>
  ID of the rubric.
</ParamField>

<ParamField body="version" type="int | None">
  Rubric version. If `None`, uses the latest version.
</ParamField>

<ParamField body="filter_dict" type="dict | None">
  Optional filter to apply to results.
</ParamField>

<ParamField body="include_failures" type="bool" default="False">
  Whether to include failed judge results in the response.
</ParamField>

### Returns

<ResponseField name="state" type="dict">
  Evaluation state.

  <Expandable title="Fields">
    <ResponseField name="results" type="list[dict]">
      List of per-agent-run result groups. Each entry contains:

      <Expandable title="AgentRunJudgeResults fields">
        <ResponseField name="agent_run_id" type="str">The agent run that was evaluated.</ResponseField>
        <ResponseField name="rubric_id" type="str">The rubric used.</ResponseField>
        <ResponseField name="rubric_version" type="int">The rubric version used.</ResponseField>
        <ResponseField name="results" type="list[dict]">List of individual judge results, each with `output`, `result_type`, and `result_metadata`.</ResponseField>
        <ResponseField name="reflection" type="dict | None">Reflection data, if the judge variant uses multi-reflection.</ResponseField>
      </Expandable>
    </ResponseField>

    <ResponseField name="job_id" type="str | None">
      ID of the evaluation job, if one exists.
    </ResponseField>

    <ResponseField name="job_status" type="str | None">
      Status of the job: `"pending"`, `"running"`, `"completed"`, or `"canceled"`.
    </ResponseField>

    <ResponseField name="total_results_needed" type="int | None">
      Total number of results expected when evaluation is complete.
    </ResponseField>

    <ResponseField name="current_results_count" type="int | None">
      Number of results completed so far.
    </ResponseField>
  </Expandable>
</ResponseField>

<Note>
  `get_rubric_run_state` does **not** start an evaluation. Use `start_rubric_eval_job()`
  first, then poll `get_rubric_run_state()` to check progress.
</Note>

***

## Example: Run and Monitor an Evaluation

```python theme={null}
import time
from docent import Docent

client = Docent()
collection_id = "my-collection-id"
rubric_id = "rubric-123"

# Start evaluation
job_id = client.start_rubric_eval_job(collection_id, rubric_id)
print(f"Started job: {job_id}")

# Poll for completion
while True:
    state = client.get_rubric_run_state(collection_id, rubric_id)
    current = state.get("current_results_count", 0)
    total = state.get("total_results_needed", 0)
    print(f"Progress: {current}/{total}")

    if state.get("job_status") in ("completed", "canceled"):
        break
    time.sleep(5)

# Analyze results — each entry groups judge results by agent run
for entry in state.get("results", []):
    for judge_result in entry["results"]:
        print(f"Run {entry['agent_run_id']}: {judge_result['output']}")
```
