Tracking Hook¶

Activated when MLFLOW_TRACKING_URI is set.

The tracking hook creates hierarchical MLflow runs mirroring the evaluation structure. Uses MlflowClient API for full isolation from user MLflow state. Thread-safe for concurrent sample processing.

Features¶

Parent run per eval invocation with nested child runs per task
Task configuration logged as parameters
Per-sample scores as step metrics
Model token usage (input/output/total per model)
Real-time event counting (model calls, tool calls)
Eval artifacts: per-sample results JSON + full eval log JSON
Additional rich table artifacts under inspect/*.json (tasks, samples, messages, sample scores, events, model usage)
Trace assessments: eval scores logged via mlflow.log_feedback()
Optional provider autolog integration for LLM SDKs
Async logging for reduced hook latency
Thread-safe counters for concurrent samples

Configuration¶

Env var	Required	Default	Description
`MLFLOW_TRACKING_URI`	Yes	–	MLflow server URL
`MLFLOW_EXPERIMENT_NAME`	No	`inspect_ai`	Experiment name
`MLFLOW_INSPECT_LOG_ARTIFACTS`	No	`true`	Log eval artifacts
`INSPECT_MLFLOW_LOG_ARTIFACTS`	No	`true`	Same as above (new prefix, takes priority)
`INSPECT_MLFLOW_AUTOLOG_ENABLED`	No	`true`	Enable MLflow provider autolog integrations
`INSPECT_MLFLOW_AUTOLOG_MODELS`	No	`openai,anthropic,langchain,litellm`	CSV or JSON array of providers to autolog

Supported provider integrations: openai, anthropic, langchain, litellm, mistral, groq, cohere, gemini, bedrock. Providers are enabled only when both the MLflow flavor module and provider SDK are present.

Artifacts¶

With artifact logging enabled, the tracking hook writes the following artifacts:

inspect/tasks.json
inspect/samples.json
inspect/messages.json
inspect/sample_scores.json
inspect/events.json
inspect/model_usage.json
sample_results/*.json
eval_logs/*.json

API Reference¶

MLflow Tracking hook for Inspect AI.

Logs evaluation runs, task configurations, sample scores, and model usage to an MLflow tracking server. Creates a parent run per eval run with nested child runs per task.

Uses MlflowClient API to avoid contaminating global mlflow state, so user code that calls mlflow.start_run() independently will not conflict.

Activated automatically when MLFLOW_TRACKING_URI is set.

class inspect_mlflow.tracking.MlflowTrackingHooks¶

Tracks Inspect AI evaluations in MLflow with hierarchical runs.

Uses MlflowClient API for isolation from user mlflow state.

property artifact_manager: ArtifactManager¶

property client: MlflowClient¶

enabled() → bool¶

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_model_usage(data: ModelUsageData) → None¶

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

Parameters:: data – Model usage data.

async on_run_end(data: RunEnd) → None¶

On run end.

Parameters:: data – Run end data.

async on_run_start(data: RunStart) → None¶

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:: data – Run start data.

async on_sample_end(data: SampleEnd) → None¶

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:: data – Sample end data.

async on_sample_event(data: SampleEvent) → None¶

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:: data – Sample event.

async on_task_end(data: TaskEnd) → None¶

On task end.

Parameters:: data – Task end data.

async on_task_start(data: TaskStart) → None¶

On task start.

Parameters:: data – Task start data.

property settings: MLflowSettings¶