Tracking Hook

Activated when MLFLOW_TRACKING_URI is set.

The tracking hook creates hierarchical MLflow runs mirroring the evaluation structure. Uses MlflowClient API for full isolation from user MLflow state. Thread-safe for concurrent sample processing.

Features

  • Parent run per eval invocation with nested child runs per task

  • Task configuration logged as parameters

  • Per-sample scores as step metrics

  • Model token usage (input/output/total per model)

  • Real-time event counting (model calls, tool calls)

  • Eval artifacts: per-sample results JSON + full eval log JSON

  • Additional rich table artifacts under inspect/*.json (tasks, samples, messages, sample scores, events, model usage)

  • Trace assessments: eval scores logged via mlflow.log_feedback()

  • Optional provider autolog integration for LLM SDKs

  • Async logging for reduced hook latency

  • Thread-safe counters for concurrent samples

Configuration

Env var

Required

Default

Description

MLFLOW_TRACKING_URI

Yes

MLflow server URL

MLFLOW_EXPERIMENT_NAME

No

inspect_ai

Experiment name

MLFLOW_INSPECT_LOG_ARTIFACTS

No

true

Log eval artifacts

INSPECT_MLFLOW_LOG_ARTIFACTS

No

true

Same as above (new prefix, takes priority)

INSPECT_MLFLOW_AUTOLOG_ENABLED

No

true

Enable MLflow provider autolog integrations

INSPECT_MLFLOW_AUTOLOG_MODELS

No

openai,anthropic,langchain,litellm

CSV or JSON array of providers to autolog

Supported provider integrations: openai, anthropic, langchain, litellm, mistral, groq, cohere, gemini, bedrock. Providers are enabled only when both the MLflow flavor module and provider SDK are present.

Artifacts

With artifact logging enabled, the tracking hook writes the following artifacts:

  • inspect/tasks.json

  • inspect/samples.json

  • inspect/messages.json

  • inspect/sample_scores.json

  • inspect/events.json

  • inspect/model_usage.json

  • sample_results/*.json

  • eval_logs/*.json

API Reference

MLflow Tracking hook for Inspect AI.

Logs evaluation runs, task configurations, sample scores, and model usage to an MLflow tracking server. Creates a parent run per eval run with nested child runs per task.

Uses MlflowClient API to avoid contaminating global mlflow state, so user code that calls mlflow.start_run() independently will not conflict.

Activated automatically when MLFLOW_TRACKING_URI is set.

class inspect_mlflow.tracking.MlflowTrackingHooks

Tracks Inspect AI evaluations in MLflow with hierarchical runs.

Uses MlflowClient API for isolation from user mlflow state.

property artifact_manager: ArtifactManager
property client: MlflowClient
enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_model_usage(data: ModelUsageData) None

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

Parameters:

data – Model usage data.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings