inspect-mlflow¶

MLflow integration for Inspect AI. Provides experiment tracking, execution tracing, evaluation comparison, LLM provider autolog, structured artifact tables, trace assessments, and Scout analysis for Inspect AI evaluations.

pip install inspect-mlflow

Set environment variables and run evals as usual. Hooks auto-register via entry points.

export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_INSPECT_TRACING="true"
inspect eval my_task.py --model openai/gpt-4o

Contents

API Reference¶

MLflow integration for Inspect AI.

Provides experiment tracking, execution tracing, evaluation comparison, and Scout analysis for Inspect AI evaluations via MLflow.

Install and use:

pip install inspect-mlflow

# Set env vars export MLFLOW_TRACKING_URI=”http://localhost:5000” export MLFLOW_INSPECT_TRACING=”true” # optional, enables tracing

# Run evals as usual. Hooks auto-activate. inspect eval my_task.py

class inspect_mlflow.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶

Complete comparison of two evaluation runs.

property aligned_count: int¶: Number of samples present in both runs.

baseline_log: str¶: Path to baseline log file.

baseline_model: str¶: Model name from baseline.

baseline_task: str¶: Task name from baseline.

candidate_log: str¶: Path to candidate log file.

candidate_model: str¶: Model name from candidate.

candidate_task: str¶: Task name from candidate.

property improvements: list[SampleComparison]¶: Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]¶: Aggregate metric comparisons.

property missing_count: int¶: Samples in baseline but not in candidate.

property new_count: int¶: Samples in candidate but not in baseline.

property regressions: list[SampleComparison]¶: Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]¶: Per-sample score comparisons.

summary() → str¶: Generate a text summary of the comparison.

property unchanged: list[SampleComparison]¶: Samples with identical scores in both runs.

property win_rate: float | None¶: Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float¶: Metric value in the baseline run.

candidate_value: float¶: Metric value in the candidate run.

ci_lower: float | None¶: Lower bound of confidence interval for the difference.

ci_upper: float | None¶: Upper bound of confidence interval for the difference.

delta: float¶: Absolute difference (candidate - baseline).

effect_size: float | None = None¶: Cohen’s d effect size. None if not computed.

name: str¶: Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None¶: P-value from significance test. None if not computed.

relative_delta: float | None¶: Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str¶: Scorer that produced this metric.

significant: bool¶: Whether the difference is statistically significant.

class inspect_mlflow.MlflowTracingHooks¶

MLflow Tracing Hooks.

Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.

enabled() → bool¶

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_run_end(data: RunEnd) → None¶

On run end.

Parameters:: data – Run end data.

async on_run_start(data: RunStart) → None¶

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:: data – Run start data.

async on_sample_end(data: SampleEnd) → None¶

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:: data – Sample end data.

async on_sample_event(data: SampleEvent) → None¶

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:: data – Sample event.

async on_sample_start(data: SampleStart) → None¶

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:: data – Sample start data.

async on_task_end(data: TaskEnd) → None¶

On task end.

Parameters:: data – Task end data.

async on_task_start(data: TaskStart) → None¶

On task start.

Parameters:: data – Task start data.

property settings: MLflowSettings¶

class inspect_mlflow.MlflowTrackingHooks¶

Tracks Inspect AI evaluations in MLflow with hierarchical runs.

Uses MlflowClient API for isolation from user mlflow state.

property artifact_manager: ArtifactManager¶

property client: MlflowClient¶

enabled() → bool¶

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_model_usage(data: ModelUsageData) → None¶

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

Parameters:: data – Model usage data.

async on_run_end(data: RunEnd) → None¶

On run end.

Parameters:: data – Run end data.

async on_run_start(data: RunStart) → None¶

On run start.

Parameters:: data – Run start data.

async on_sample_end(data: SampleEnd) → None¶

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:: data – Sample end data.

async on_sample_event(data: SampleEvent) → None¶

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:: data – Sample event.

async on_task_end(data: TaskEnd) → None¶

On task end.

Parameters:: data – Task end data.

async on_task_start(data: TaskStart) → None¶

On task start.

Parameters:: data – Task start data.

property settings: MLflowSettings¶

class inspect_mlflow.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶

Comparison of a single sample’s score between two runs.

baseline_score: float | None¶: Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None¶: Score value in the candidate run. None if sample missing from candidate.

delta: float | None¶: Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']¶: Classification of the score change between runs.

epoch: int¶: Epoch number.

id: int | str¶: Sample ID.

scorer: str¶: Scorer that produced this score.

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:

baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

async inspect_mlflow.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) → AsyncIterator[Transcript]¶

Import MLflow traces as Scout transcripts.

Parameters:

experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.
tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.
limit – Maximum number of traces to import. None for all.

Yields:

Transcript objects ready for Scout database insertion.