inspect-mlflow

MLflow integration for Inspect AI. Provides experiment tracking, execution tracing, evaluation comparison, LLM provider autolog, structured artifact tables, trace assessments, and Scout analysis for Inspect AI evaluations.

pip install inspect-mlflow

Set environment variables and run evals as usual. Hooks auto-register via entry points.

export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_INSPECT_TRACING="true"
inspect eval my_task.py --model openai/gpt-4o

API Reference

MLflow integration for Inspect AI.

Provides experiment tracking, execution tracing, evaluation comparison, and Scout analysis for Inspect AI evaluations via MLflow.

Install and use:

pip install inspect-mlflow

# Set env vars export MLFLOW_TRACKING_URI=”http://localhost:5000” export MLFLOW_INSPECT_TRACING=”true” # optional, enables tracing

# Run evals as usual. Hooks auto-activate. inspect eval my_task.py

class inspect_mlflow.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)

Complete comparison of two evaluation runs.

property aligned_count: int

Number of samples present in both runs.

baseline_log: str

Path to baseline log file.

baseline_model: str

Model name from baseline.

baseline_task: str

Task name from baseline.

candidate_log: str

Path to candidate log file.

candidate_model: str

Model name from candidate.

candidate_task: str

Task name from candidate.

property improvements: list[SampleComparison]

Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]

Aggregate metric comparisons.

property missing_count: int

Samples in baseline but not in candidate.

property new_count: int

Samples in candidate but not in baseline.

property regressions: list[SampleComparison]

Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]

Per-sample score comparisons.

summary() str

Generate a text summary of the comparison.

property unchanged: list[SampleComparison]

Samples with identical scores in both runs.

property win_rate: float | None

Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float

Metric value in the baseline run.

candidate_value: float

Metric value in the candidate run.

ci_lower: float | None

Lower bound of confidence interval for the difference.

ci_upper: float | None

Upper bound of confidence interval for the difference.

delta: float

Absolute difference (candidate - baseline).

effect_size: float | None = None

Cohen’s d effect size. None if not computed.

name: str

Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None

P-value from significance test. None if not computed.

relative_delta: float | None

Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str

Scorer that produced this metric.

significant: bool

Whether the difference is statistically significant.

class inspect_mlflow.MlflowTracingHooks

MLflow Tracing Hooks.

Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.

enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_sample_start(data: SampleStart) None

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample start data.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings
class inspect_mlflow.MlflowTrackingHooks

Tracks Inspect AI evaluations in MLflow with hierarchical runs.

Uses MlflowClient API for isolation from user mlflow state.

property artifact_manager: ArtifactManager
property client: MlflowClient
enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_model_usage(data: ModelUsageData) None

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

Parameters:

data – Model usage data.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings
class inspect_mlflow.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])

Comparison of a single sample’s score between two runs.

baseline_score: float | None

Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None

Score value in the candidate run. None if sample missing from candidate.

delta: float | None

Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']

Classification of the score change between runs.

epoch: int

Epoch number.

id: int | str

Sample ID.

scorer: str

Scorer that produced this score.

inspect_mlflow.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:
  • baseline – Path to baseline eval log, or an EvalLog object.

  • candidate – Path to candidate eval log, or an EvalLog object.

  • scorers – Specific scorer names to compare. None compares all.

  • significance – P-value threshold for significance tests.

  • regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).

  • sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

async inspect_mlflow.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) AsyncIterator[Transcript]

Import MLflow traces as Scout transcripts.

Parameters:
  • experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.

  • tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.

  • limit – Maximum number of traces to import. None for all.

Yields:

Transcript objects ready for Scout database insertion.