Tracing Hook¶
Activated when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=true are set.
Maps evaluation execution to MLflow trace spans, giving users a visual debugging view of every model call, tool invocation, and scoring step.
Span Types¶
Span Type |
Data Captured |
|---|---|
CHAIN |
Eval run, task, and sample lifecycle with scores and timing |
LLM |
Model name, token counts, temperature, cache status, response text |
TOOL |
Function name, arguments, result, working time, errors |
EVALUATOR |
Score value, explanation, target |
Trace Assessments¶
Eval scores are automatically logged as MLflow trace assessments via mlflow.log_feedback().
Scores appear in the MLflow Traces UI assessment column with the scorer name, value, and rationale.
Configuration¶
Env var |
Required |
Default |
Description |
|---|---|---|---|
|
Yes (in addition to MLFLOW_TRACKING_URI) |
|
Enable execution tracing |
API Reference¶
MLflow Tracing hook for Inspect AI.
Maps evaluation execution flow to MLflow trace spans, giving users the MLflow trace UI for debugging why a particular sample scored the way it did.
Creates a span tree mirroring the eval hierarchy:
- eval_run (root)
- task: math_reasoning (CHAIN)
- sample: q1 (CHAIN)
model_call: gpt-4o (LLM) - 847 tokens, 1.2s tool_call: calculator (TOOL) - args: {“expr”: “2+2”}, result: 4 score: accuracy (EVALUATOR) - value: C
Activated automatically when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=”true” are set.
- class inspect_mlflow.tracing.MlflowTracingHooks¶
MLflow Tracing Hooks.
Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- async on_sample_start(data: SampleStart) None¶
On sample start.
Called when a sample is about to be start. If the sample errors and retries, this will not be called again.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample start data.
- property settings: MLflowSettings¶