Skip to main content

Before you start

We generally recommend using the Docent Agent to ingest your traces. The /docent skill writes the SDK script for you from your existing logs. Use this page if you want to debug what /docent produced, have unusual data formats the Docent Agent can’t infer, or need fine-grained control. If you already have an Inspect .eval file, the fastest path is drag-and-drop upload. Otherwise, follow the steps below.

Setup

Install the SDK:
pip install docent-python
Go to the API keys page, create a key, and instantiate a client object with that key:
import os
from docent import Docent

client = Docent(
    api_key=os.getenv("DOCENT_API_KEY"),  # is default and can be omitted

    # Uncomment and adjust these if you're self-hosting
    # server_url="http://localhost:8889",
    # web_url="http://localhost:3001",
)

Create a collection

collection_id = client.create_collection(
    name="sample collection",
    description="example that comes with the Docent repo",
)

Convert your data

There are three end-to-end examples below; pick whichever matches your data.
If your messages are already in OpenAI chat format ({"role": ..., "content": ..., "tool_calls": ...}), use parse_chat_message to convert each one into a ChatMessage. All three examples below use this helper.
Say we have three simple agent runs.
transcript_1 = [
    {
        "role": "user",
        "content": "What's the weather like in New York today?"
    },
    {
        "role": "assistant",
        "content": "The weather in New York today is mostly sunny with a high of 75°F (24°C)."
    }
]
metadata_1 = {"model": "gpt-3.5-turbo", "agent_scaffold": "foo", "hallucinated": True}
transcript_2 = [
    {
        "role": "user",
        "content": "What's the weather like in San Francisco today?"
    },
    {
        "role": "assistant",
        "content": "The weather in San Francisco today is mostly cloudy with a high of 65°F (18°C)."
    }
]
metadata_2 = {"model": "gpt-3.5-turbo", "agent_scaffold": "foo", "hallucinated": True}
transcript_3 = [
    {
        "role": "user",
        "content": "What's the weather like in Paris today?"
    },
    {
        "role": "assistant",
        "content": "I'm sorry, I don't know because I don't have access to weather tools."
    }
]
metadata_3 = {"model": "gpt-3.5-turbo", "agent_scaffold": "bar", "hallucinated": False}

transcripts = [transcript_1, transcript_2, transcript_3]
metadata = [metadata_1, metadata_2, metadata_3]
We need to convert each input into an AgentRun object, which holds Transcript objects where each message needs to be a ChatMessage. We could construct the messages manually, but it’s easier to use the parse_chat_message function, since the raw dicts already conform to the expected schema.
from docent.data_models.chat import parse_chat_message
from docent.data_models import Transcript

parsed_transcripts = [
    Transcript(messages=[parse_chat_message(msg) for msg in transcript])
    for transcript in transcripts
]
Now we can create the AgentRun objects.
from docent.data_models import AgentRun

agent_runs = [
    AgentRun(
        transcripts=[t],
        metadata={
            "model": m["model"],
            "agent_scaffold": m["agent_scaffold"],
            "scores": {"hallucinated": m["hallucinated"]},
        }
    )
    for t, m in zip(parsed_transcripts, metadata)
]

Upload the runs

client.add_agent_runs(collection_id, agent_runs)
If you navigate to the frontend URL printed by client.create_collection(...), you should see the run available for viewing.

Tips and tricks

Including sufficient context

Docent can only catch issues that are evident from the context it has about your evaluation. For example:
  • If you’re looking to catch issues with solution labels, you should provide the exact label in the metadata, not just the agent’s score.
  • For software engineering tasks, if you want to know why agents failed, you should include information about what tests were run and their traceback/execution logs.