API Reference¶
MLflow integration for Inspect AI.
Provides experiment tracking, execution tracing, evaluation comparison, and Scout analysis for Inspect AI evaluations via MLflow.
Install and use:
pip install inspect-mlflow
# Set env vars export MLFLOW_TRACKING_URI=”http://localhost:5000” export MLFLOW_INSPECT_TRACING=”true” # optional, enables tracing
# Run evals as usual. Hooks auto-activate. inspect eval my_task.py
- class inspect_mlflow.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶
Complete comparison of two evaluation runs.
- property improvements: list[SampleComparison]¶
Samples where the candidate scored higher than baseline.
- metrics: list[MetricComparison]¶
Aggregate metric comparisons.
- property regressions: list[SampleComparison]¶
Samples where the candidate scored lower than baseline.
- samples: list[SampleComparison]¶
Per-sample score comparisons.
- property unchanged: list[SampleComparison]¶
Samples with identical scores in both runs.
- class inspect_mlflow.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶
Comparison of an aggregate metric between two evaluation runs.
- class inspect_mlflow.MlflowTracingHooks¶
MLflow Tracing Hooks.
Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- async on_sample_start(data: SampleStart) None¶
On sample start.
Called when a sample is about to be start. If the sample errors and retries, this will not be called again.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample start data.
- property settings: MLflowSettings¶
- class inspect_mlflow.MlflowTrackingHooks¶
Tracks Inspect AI evaluations in MLflow with hierarchical runs.
Uses MlflowClient API for isolation from user mlflow state.
- property artifact_manager: ArtifactManager¶
- property client: MlflowClient¶
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_model_usage(data: ModelUsageData) None¶
Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.
Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.
- Parameters:
data – Model usage data.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- property settings: MLflowSettings¶
- class inspect_mlflow.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶
Comparison of a single sample’s score between two runs.
- baseline_score: float | None¶
Score value in the baseline run. None if sample missing from baseline.
- candidate_score: float | None¶
Score value in the candidate run. None if sample missing from candidate.
- inspect_mlflow.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult¶
Compare results from two evaluation runs.
Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.
- Parameters:
baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.
- Returns:
ComparisonResult with metrics, sample comparisons, and regressions.
- async inspect_mlflow.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) AsyncIterator[Transcript]¶
Import MLflow traces as Scout transcripts.
- Parameters:
experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.
tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.
limit – Maximum number of traces to import. None for all.
- Yields:
Transcript objects ready for Scout database insertion.
MLflow Tracking hook for Inspect AI.
Logs evaluation runs, task configurations, sample scores, and model usage to an MLflow tracking server. Creates a parent run per eval run with nested child runs per task.
Uses MlflowClient API to avoid contaminating global mlflow state, so user code that calls mlflow.start_run() independently will not conflict.
Activated automatically when MLFLOW_TRACKING_URI is set.
- class inspect_mlflow.tracking.MlflowTrackingHooks¶
Tracks Inspect AI evaluations in MLflow with hierarchical runs.
Uses MlflowClient API for isolation from user mlflow state.
- property artifact_manager: ArtifactManager¶
- property client: MlflowClient¶
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_model_usage(data: ModelUsageData) None¶
Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.
Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.
- Parameters:
data – Model usage data.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- property settings: MLflowSettings¶
MLflow Tracing hook for Inspect AI.
Maps evaluation execution flow to MLflow trace spans, giving users the MLflow trace UI for debugging why a particular sample scored the way it did.
Creates a span tree mirroring the eval hierarchy:
- eval_run (root)
- task: math_reasoning (CHAIN)
- sample: q1 (CHAIN)
model_call: gpt-4o (LLM) - 847 tokens, 1.2s tool_call: calculator (TOOL) - args: {“expr”: “2+2”}, result: 4 score: accuracy (EVALUATOR) - value: C
Activated automatically when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=”true” are set.
- class inspect_mlflow.tracing.MlflowTracingHooks¶
MLflow Tracing Hooks.
Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- async on_sample_start(data: SampleStart) None¶
On sample start.
Called when a sample is about to be start. If the sample errors and retries, this will not be called again.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample start data.
- property settings: MLflowSettings¶
Evaluation comparison and regression detection.
Compare results from two Inspect AI evaluation runs to detect score regressions, compute statistical significance, and generate reports.
Example usage:
from inspect_mlflow.comparison import compare_evals
result = compare_evals("logs/baseline.eval", "logs/candidate.eval")
print(result.summary())
for r in result.regressions:
print(f"Sample {r.id} regressed: {r.baseline_score} -> {r.candidate_score}")
- class inspect_mlflow.comparison.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶
Complete comparison of two evaluation runs.
- property improvements: list[SampleComparison]¶
Samples where the candidate scored higher than baseline.
- metrics: list[MetricComparison]¶
Aggregate metric comparisons.
- property regressions: list[SampleComparison]¶
Samples where the candidate scored lower than baseline.
- samples: list[SampleComparison]¶
Per-sample score comparisons.
- property unchanged: list[SampleComparison]¶
Samples with identical scores in both runs.
- class inspect_mlflow.comparison.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶
Comparison of an aggregate metric between two evaluation runs.
- class inspect_mlflow.comparison.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶
Comparison of a single sample’s score between two runs.
- baseline_score: float | None¶
Score value in the baseline run. None if sample missing from baseline.
- candidate_score: float | None¶
Score value in the candidate run. None if sample missing from candidate.
- inspect_mlflow.comparison.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) float | None¶
Compute Cohen’s d effect size for paired samples.
Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.
- Parameters:
baseline_scores – Per-sample scores from baseline.
candidate_scores – Per-sample scores from candidate.
- Returns:
Cohen’s d value, or None if fewer than 2 samples.
- inspect_mlflow.comparison.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult¶
Compare results from two evaluation runs.
Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.
- Parameters:
baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.
- Returns:
ComparisonResult with metrics, sample comparisons, and regressions.
Core comparison logic for evaluation runs.
Loads two eval logs, aligns samples, computes score deltas, runs significance tests, and returns a structured ComparisonResult.
- inspect_mlflow.comparison._compare.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult¶
Compare results from two evaluation runs.
Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.
- Parameters:
baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.
- Returns:
ComparisonResult with metrics, sample comparisons, and regressions.
Data types for evaluation comparison results.
- class inspect_mlflow.comparison._types.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶
Complete comparison of two evaluation runs.
- property improvements: list[SampleComparison]¶
Samples where the candidate scored higher than baseline.
- metrics: list[MetricComparison]¶
Aggregate metric comparisons.
- property regressions: list[SampleComparison]¶
Samples where the candidate scored lower than baseline.
- samples: list[SampleComparison]¶
Per-sample score comparisons.
- property unchanged: list[SampleComparison]¶
Samples with identical scores in both runs.
- class inspect_mlflow.comparison._types.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶
Comparison of an aggregate metric between two evaluation runs.
- class inspect_mlflow.comparison._types.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶
Comparison of a single sample’s score between two runs.
- baseline_score: float | None¶
Score value in the baseline run. None if sample missing from baseline.
- candidate_score: float | None¶
Score value in the candidate run. None if sample missing from candidate.
Statistical tests for evaluation comparison.
Provides bootstrap confidence intervals, McNemar’s test for paired binary outcomes, permutation tests, and effect size. Uses NumPy only (no scipy dependency).
- class inspect_mlflow.comparison._statistics.SignificanceResult(significant: bool, p_value: float, ci_lower: float, ci_upper: float, method: str)¶
Result of a statistical significance test.
- inspect_mlflow.comparison._statistics.bootstrap_ci(baseline_scores: list[float], candidate_scores: list[float], significance: float = 0.05, n_resamples: int = 10000, seed: int | None = 42) SignificanceResult¶
Compute bootstrap confidence interval for the difference in means.
Uses the shifted bootstrap (centered under H0) for proper two-sided p-value computation. Vectorized in batches for large datasets.
- Parameters:
baseline_scores – Per-sample scores from baseline run.
candidate_scores – Per-sample scores from candidate run.
significance – Significance level (default 0.05 for 95% CI).
n_resamples – Number of bootstrap resamples.
seed – Random seed for reproducibility.
- Returns:
SignificanceResult with CI bounds and p-value.
- inspect_mlflow.comparison._statistics.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) float | None¶
Compute Cohen’s d effect size for paired samples.
Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.
- Parameters:
baseline_scores – Per-sample scores from baseline.
candidate_scores – Per-sample scores from candidate.
- Returns:
Cohen’s d value, or None if fewer than 2 samples.
- inspect_mlflow.comparison._statistics.mcnemars_test(baseline_correct: list[bool], candidate_correct: list[bool], significance: float = 0.05) SignificanceResult¶
McNemar’s test for paired binary outcomes.
Tests whether the rate of discordant pairs (one correct, one incorrect) differs significantly between runs. Uses the chi-square approximation with continuity correction.
- Parameters:
baseline_correct – Per-sample correctness from baseline (True/False).
candidate_correct – Per-sample correctness from candidate (True/False).
significance – Significance level.
- Returns:
SignificanceResult with p-value.
- inspect_mlflow.comparison._statistics.permutation_test(baseline_scores: list[float], candidate_scores: list[float], significance: float = 0.05, n_iterations: int = 10000, seed: int | None = 42) SignificanceResult¶
Two-sided permutation test for paired samples.
Randomly swaps baseline/candidate labels and computes the mean difference under the null hypothesis of no difference. Vectorized in batches for large datasets.
- Parameters:
baseline_scores – Per-sample scores from baseline.
candidate_scores – Per-sample scores from candidate.
significance – Significance level.
n_iterations – Number of permutation iterations.
seed – Random seed for reproducibility.
- Returns:
SignificanceResult with p-value.
Scout import source for MLflow traces.
Imports MLflow traces into an Inspect Scout transcript database, enabling Scout scanners to analyze any MLflow-traced LLM application.
Usage:
from inspect_mlflow.scout import import_mlflow_traces from inspect_scout import transcripts_db
- async with transcripts_db(“./my-transcripts”) as db:
- await db.insert(import_mlflow_traces(
experiment_name=”inspect-mlflow-demo”, tracking_uri=”http://localhost:5000”,
))
- async inspect_mlflow.scout.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) AsyncIterator[Transcript]¶
Import MLflow traces as Scout transcripts.
- Parameters:
experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.
tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.
limit – Maximum number of traces to import. None for all.
- Yields:
Transcript objects ready for Scout database insertion.
- inspect_mlflow.scout.traces_request_time(trace: Any) str | None¶
Extract request time from trace info.
Autolog utilities for MLflow hook integration.
- inspect_mlflow._autolog.enable_autolog(models: list[str], *, find_spec: ~collections.abc.Callable[[str], ~typing.Any] = <function find_spec>, import_module: ~collections.abc.Callable[[str], ~typing.Any] = <function import_module>) bool¶
Enable MLflow autolog for selected model providers.
Returns True if at least one provider was enabled.
Artifact table extraction and shaping helpers for MLflow tracking.
- inspect_mlflow.artifacts.tables.extract_event_rows(*, eval_id: str, task_name: str, sample_id: Any, sample: Any) list[dict[str, Any]]¶
- inspect_mlflow.artifacts.tables.extract_inspect_table_rows(*, eval_id: str, task_name: str, log: Any) dict[str, list[dict[str, Any]]]¶
Build inspect table rows from eval log content.
- inspect_mlflow.artifacts.tables.extract_message_rows(*, eval_id: str, task_name: str, sample_id: Any, sample: Any) list[dict[str, Any]]¶
- inspect_mlflow.artifacts.tables.extract_model_usage_rows(*, eval_id: str, task_name: str, sample_id: Any, sample: Any) list[dict[str, Any]]¶
- inspect_mlflow.artifacts.tables.extract_sample_score_rows(*, eval_id: str, task_name: str, sample_id: Any, scores: Any) list[dict[str, Any]]¶
Artifact manager for MLflow tracking hook.
- class inspect_mlflow.artifacts.manager.ArtifactManager(client: MlflowClient, logger: Logger | None = None)¶
Handle Inspect artifact extraction and MLflow artifact logging.
Configuration for inspect-mlflow hooks.
Uses pydantic-settings when available for typed, validated config with the INSPECT_MLFLOW_ prefix. Falls back to os.getenv() when pydantic-settings is not installed.
- class inspect_mlflow.config.MLflowSettings(tracking_uri: str | None = None, experiment_name: str = 'inspect_ai', tracing_enabled: bool = False, log_artifacts: bool = True, autolog_enabled: bool = True, autolog_models: list[str] = <factory>)¶
Fallback settings using os.getenv() when pydantic-settings is not installed.
- inspect_mlflow.config.load_settings() MLflowSettings¶
Shared utilities for MLflow hooks.
- inspect_mlflow.util.safe_log_params(mlflow: Any, params: dict[str, Any]) None¶
Log params, truncating values that exceed MLflow’s 500-char limit.
- inspect_mlflow.util.score_to_numeric(value: Any) float | None¶
Convert a Score value to a numeric value for MLflow metrics.
Handles Inspect AI score conventions: - int/float: returned as-is - bool: True -> 1.0, False -> 0.0 - str: “C”/”correct” -> 1.0, “I”/”incorrect” -> 0.0, “P”/”partial” -> 0.5 - other: None (metric skipped)