Tracing Hook

Activated when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=true are set.

Maps evaluation execution to MLflow trace spans, giving users a visual debugging view of every model call, tool invocation, and scoring step.

Span Types

Span Type

Data Captured

CHAIN

Eval run, task, and sample lifecycle with scores and timing

LLM

Model name, token counts, temperature, cache status, response text

TOOL

Function name, arguments, result, working time, errors

EVALUATOR

Score value, explanation, target

Trace Assessments

Eval scores are automatically logged as MLflow trace assessments via mlflow.log_feedback(). Scores appear in the MLflow Traces UI assessment column with the scorer name, value, and rationale.

Configuration

Env var

Required

Default

Description

MLFLOW_INSPECT_TRACING

Yes (in addition to MLFLOW_TRACKING_URI)

false

Enable execution tracing

API Reference

MLflow Tracing hook for Inspect AI.

Maps evaluation execution flow to MLflow trace spans, giving users the MLflow trace UI for debugging why a particular sample scored the way it did.

Creates a span tree mirroring the eval hierarchy:

eval_run (root)
task: math_reasoning (CHAIN)
sample: q1 (CHAIN)

model_call: gpt-4o (LLM) - 847 tokens, 1.2s tool_call: calculator (TOOL) - args: {“expr”: “2+2”}, result: 4 score: accuracy (EVALUATOR) - value: C

Activated automatically when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=”true” are set.

class inspect_mlflow.tracing.MlflowTracingHooks

MLflow Tracing Hooks.

Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.

enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_sample_start(data: SampleStart) None

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample start data.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings