Evaluation Comparison

Compare results from two Inspect AI evaluation runs to detect score regressions, compute statistical significance, and generate comparison reports.

Quick Start

from inspect_mlflow.comparison import compare_evals

result = compare_evals("logs/baseline.eval", "logs/candidate.eval")
print(result.summary())

for r in result.regressions:
    print(f"Sample {r.id}: {r.baseline_score} -> {r.candidate_score}")

Output:

Baseline:  openai/gpt-4o-mini (math_task)
Candidate: openai/gpt-4o-mini (math_task)
Samples:   5 aligned, 0 missing, 0 new

  Metric            Baseline  Candidate             Delta        Sig.
  -------------------------------------------------------------------
  match/accuracy      0.6000     0.4000   -0.2000 (-33.3%)  p=0.048*
  Effect size (match/accuracy): Cohen's d = -0.73 (medium effect)

Regressions: 2, Improvements: 1, Unchanged: 2
Candidate won on 1 of 5 samples (20.0%)

Features

  • Sample alignment by (id, epoch) key with string/int ID normalization

  • Automatic test selection: McNemar’s test for binary scores (0/1), bootstrap CI for continuous

  • Effect size: Cohen’s d computed independently of sample size

  • Regression threshold: filter noise with regression_threshold=0.05

  • Sample filtering: sample_filter=lambda s: s.id in subset

  • Win rate tracking across aligned samples

  • No scipy dependency: all statistics implemented with NumPy only

Parameters

Parameter

Default

Description

baseline

(required)

Path to baseline eval log or EvalLog object

candidate

(required)

Path to candidate eval log or EvalLog object

scorers

None

Scorer names to compare. None compares all common scorers

significance

0.05

P-value threshold for significance tests

regression_threshold

0.0

Minimum delta to count as regression or improvement

sample_filter

None

Function to filter samples before comparison

Statistical Tests

The comparison module selects the appropriate test based on score distribution:

Binary scores (all values are 0.0 or 1.0): McNemar’s test with continuity correction. Tests whether discordant pairs (one run correct, other incorrect) are asymmetrically distributed.

Continuous scores: Shifted bootstrap confidence interval with 10,000 resamples. Computes a two-sided p-value under the null hypothesis of no difference.

Effect size: Cohen’s d is always computed for primary metrics. Values around 0.2 are small, 0.5 medium, and 0.8 large.

Result Objects

ComparisonResult provides these properties:

  • metrics: aggregate metric comparisons with significance results

  • samples: per-sample score comparisons with direction classification

  • regressions: samples where candidate scored lower

  • improvements: samples where candidate scored higher

  • unchanged: samples with identical scores

  • aligned_count, missing_count, new_count: alignment counts

  • win_rate: fraction of aligned samples where candidate won

  • summary(): formatted text report

API Reference

inspect_mlflow.comparison.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:
  • baseline – Path to baseline eval log, or an EvalLog object.

  • candidate – Path to candidate eval log, or an EvalLog object.

  • scorers – Specific scorer names to compare. None compares all.

  • significance – P-value threshold for significance tests.

  • regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).

  • sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

inspect_mlflow.comparison._statistics.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) float | None

Compute Cohen’s d effect size for paired samples.

Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.

Parameters:
  • baseline_scores – Per-sample scores from baseline.

  • candidate_scores – Per-sample scores from candidate.

Returns:

Cohen’s d value, or None if fewer than 2 samples.

class inspect_mlflow.comparison.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)

Complete comparison of two evaluation runs.

property aligned_count: int

Number of samples present in both runs.

baseline_log: str

Path to baseline log file.

baseline_model: str

Model name from baseline.

baseline_task: str

Task name from baseline.

candidate_log: str

Path to candidate log file.

candidate_model: str

Model name from candidate.

candidate_task: str

Task name from candidate.

property improvements: list[SampleComparison]

Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]

Aggregate metric comparisons.

property missing_count: int

Samples in baseline but not in candidate.

property new_count: int

Samples in candidate but not in baseline.

property regressions: list[SampleComparison]

Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]

Per-sample score comparisons.

summary() str

Generate a text summary of the comparison.

property unchanged: list[SampleComparison]

Samples with identical scores in both runs.

property win_rate: float | None

Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.comparison.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float

Metric value in the baseline run.

candidate_value: float

Metric value in the candidate run.

ci_lower: float | None

Lower bound of confidence interval for the difference.

ci_upper: float | None

Upper bound of confidence interval for the difference.

delta: float

Absolute difference (candidate - baseline).

effect_size: float | None = None

Cohen’s d effect size. None if not computed.

name: str

Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None

P-value from significance test. None if not computed.

relative_delta: float | None

Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str

Scorer that produced this metric.

significant: bool

Whether the difference is statistically significant.

class inspect_mlflow.comparison.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])

Comparison of a single sample’s score between two runs.

baseline_score: float | None

Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None

Score value in the candidate run. None if sample missing from candidate.

delta: float | None

Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']

Classification of the score change between runs.

epoch: int

Epoch number.

id: int | str

Sample ID.

scorer: str

Scorer that produced this score.