Tracing Hook¶

Activated when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=true are set.

Maps evaluation execution to MLflow trace spans, giving users a visual debugging view of every model call, tool invocation, and scoring step.

Span Types¶

Span Type	Data Captured
CHAIN	Eval run, task, and sample lifecycle with scores and timing
LLM	Model name, token counts, temperature, cache status, response text
TOOL	Function name, arguments, result, working time, errors
EVALUATOR	Score value, explanation, target

Trace Assessments¶

Eval scores are automatically logged as MLflow trace assessments via mlflow.log_feedback(). Scores appear in the MLflow Traces UI assessment column with the scorer name, value, and rationale.

Configuration¶

Env var	Required	Default	Description
`MLFLOW_INSPECT_TRACING`	Yes (in addition to MLFLOW_TRACKING_URI)	`false`	Enable execution tracing

API Reference¶

MLflow Tracing hook for Inspect AI.

Maps evaluation execution flow to MLflow trace spans, giving users the MLflow trace UI for debugging why a particular sample scored the way it did.

Creates a span tree mirroring the eval hierarchy:

eval_run (root)

task: math_reasoning (CHAIN)

sample: q1 (CHAIN)
model_call: gpt-4o (LLM) - 847 tokens, 1.2s tool_call: calculator (TOOL) - args: {“expr”: “2+2”}, result: 4 score: accuracy (EVALUATOR) - value: C

Activated automatically when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=”true” are set.

class inspect_mlflow.tracing.MlflowTracingHooks¶

MLflow Tracing Hooks.

Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.

enabled() → bool¶

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_run_end(data: RunEnd) → None¶

On run end.

Parameters:: data – Run end data.

async on_run_start(data: RunStart) → None¶

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:: data – Run start data.

async on_sample_end(data: SampleEnd) → None¶

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:: data – Sample end data.

async on_sample_event(data: SampleEvent) → None¶

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:: data – Sample event.

async on_sample_start(data: SampleStart) → None¶

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:: data – Sample start data.

async on_task_end(data: TaskEnd) → None¶

On task end.

Parameters:: data – Task end data.

async on_task_start(data: TaskStart) → None¶

On task start.

Parameters:: data – Task start data.

property settings: MLflowSettings¶