Evaluation Comparison¶

Compare results from two Inspect AI evaluation runs to detect score regressions, compute statistical significance, and generate comparison reports.

Quick Start¶

from inspect_mlflow.comparison import compare_evals

result = compare_evals("logs/baseline.eval", "logs/candidate.eval")
print(result.summary())

for r in result.regressions:
    print(f"Sample {r.id}: {r.baseline_score} -> {r.candidate_score}")

Output:

Baseline:  openai/gpt-4o-mini (math_task)
Candidate: openai/gpt-4o-mini (math_task)
Samples:   5 aligned, 0 missing, 0 new

  Metric            Baseline  Candidate             Delta        Sig.
  -------------------------------------------------------------------
  match/accuracy      0.6000     0.4000   -0.2000 (-33.3%)  p=0.048*
  Effect size (match/accuracy): Cohen's d = -0.73 (medium effect)

Regressions: 2, Improvements: 1, Unchanged: 2
Candidate won on 1 of 5 samples (20.0%)

Features¶

Sample alignment by (id, epoch) key with string/int ID normalization
Automatic test selection: McNemar’s test for binary scores (0/1), bootstrap CI for continuous
Effect size: Cohen’s d computed independently of sample size
Regression threshold: filter noise with regression_threshold=0.05
Sample filtering: sample_filter=lambda s: s.id in subset
Win rate tracking across aligned samples
No scipy dependency: all statistics implemented with NumPy only

Parameters¶

Parameter	Default	Description
`baseline`	(required)	Path to baseline eval log or `EvalLog` object
`candidate`	(required)	Path to candidate eval log or `EvalLog` object
`scorers`	`None`	Scorer names to compare. `None` compares all common scorers
`significance`	`0.05`	P-value threshold for significance tests
`regression_threshold`	`0.0`	Minimum delta to count as regression or improvement
`sample_filter`	`None`	Function to filter samples before comparison

Statistical Tests¶

The comparison module selects the appropriate test based on score distribution:

Binary scores (all values are 0.0 or 1.0): McNemar’s test with continuity correction. Tests whether discordant pairs (one run correct, other incorrect) are asymmetrically distributed.

Continuous scores: Shifted bootstrap confidence interval with 10,000 resamples. Computes a two-sided p-value under the null hypothesis of no difference.

Effect size: Cohen’s d is always computed for primary metrics. Values around 0.2 are small, 0.5 medium, and 0.8 large.

Result Objects¶

ComparisonResult provides these properties:

metrics: aggregate metric comparisons with significance results
samples: per-sample score comparisons with direction classification
regressions: samples where candidate scored lower
improvements: samples where candidate scored higher
unchanged: samples with identical scores
aligned_count, missing_count, new_count: alignment counts
win_rate: fraction of aligned samples where candidate won
summary(): formatted text report

API Reference¶

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:

baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

inspect_mlflow.comparison._statistics.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) → float | None¶

Compute Cohen’s d effect size for paired samples.

Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.

Parameters:

baseline_scores – Per-sample scores from baseline.
candidate_scores – Per-sample scores from candidate.

Returns:

Cohen’s d value, or None if fewer than 2 samples.

class inspect_mlflow.comparison.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶

Complete comparison of two evaluation runs.

property aligned_count: int¶: Number of samples present in both runs.

baseline_log: str¶: Path to baseline log file.

baseline_model: str¶: Model name from baseline.

baseline_task: str¶: Task name from baseline.

candidate_log: str¶: Path to candidate log file.

candidate_model: str¶: Model name from candidate.

candidate_task: str¶: Task name from candidate.

property improvements: list[SampleComparison]¶: Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]¶: Aggregate metric comparisons.

property missing_count: int¶: Samples in baseline but not in candidate.

property new_count: int¶: Samples in candidate but not in baseline.

property regressions: list[SampleComparison]¶: Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]¶: Per-sample score comparisons.

summary() → str¶: Generate a text summary of the comparison.

property unchanged: list[SampleComparison]¶: Samples with identical scores in both runs.

property win_rate: float | None¶: Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.comparison.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float¶: Metric value in the baseline run.

candidate_value: float¶: Metric value in the candidate run.

ci_lower: float | None¶: Lower bound of confidence interval for the difference.

ci_upper: float | None¶: Upper bound of confidence interval for the difference.

delta: float¶: Absolute difference (candidate - baseline).

effect_size: float | None = None¶: Cohen’s d effect size. None if not computed.

name: str¶: Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None¶: P-value from significance test. None if not computed.

relative_delta: float | None¶: Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str¶: Scorer that produced this metric.

significant: bool¶: Whether the difference is statistically significant.

class inspect_mlflow.comparison.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶

Comparison of a single sample’s score between two runs.

baseline_score: float | None¶: Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None¶: Score value in the candidate run. None if sample missing from candidate.

delta: float | None¶: Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']¶: Classification of the score change between runs.

epoch: int¶: Epoch number.

id: int | str¶: Sample ID.

scorer: str¶: Scorer that produced this score.