Evaluation Comparison¶
Compare results from two Inspect AI evaluation runs to detect score regressions, compute statistical significance, and generate comparison reports.
Quick Start¶
from inspect_mlflow.comparison import compare_evals
result = compare_evals("logs/baseline.eval", "logs/candidate.eval")
print(result.summary())
for r in result.regressions:
print(f"Sample {r.id}: {r.baseline_score} -> {r.candidate_score}")
Output:
Baseline: openai/gpt-4o-mini (math_task)
Candidate: openai/gpt-4o-mini (math_task)
Samples: 5 aligned, 0 missing, 0 new
Metric Baseline Candidate Delta Sig.
-------------------------------------------------------------------
match/accuracy 0.6000 0.4000 -0.2000 (-33.3%) p=0.048*
Effect size (match/accuracy): Cohen's d = -0.73 (medium effect)
Regressions: 2, Improvements: 1, Unchanged: 2
Candidate won on 1 of 5 samples (20.0%)
Features¶
Sample alignment by
(id, epoch)key with string/int ID normalizationAutomatic test selection: McNemar’s test for binary scores (0/1), bootstrap CI for continuous
Effect size: Cohen’s d computed independently of sample size
Regression threshold: filter noise with
regression_threshold=0.05Sample filtering:
sample_filter=lambda s: s.id in subsetWin rate tracking across aligned samples
No scipy dependency: all statistics implemented with NumPy only
Parameters¶
Parameter |
Default |
Description |
|---|---|---|
|
(required) |
Path to baseline eval log or |
|
(required) |
Path to candidate eval log or |
|
|
Scorer names to compare. |
|
|
P-value threshold for significance tests |
|
|
Minimum delta to count as regression or improvement |
|
|
Function to filter samples before comparison |
Statistical Tests¶
The comparison module selects the appropriate test based on score distribution:
Binary scores (all values are 0.0 or 1.0): McNemar’s test with continuity correction. Tests whether discordant pairs (one run correct, other incorrect) are asymmetrically distributed.
Continuous scores: Shifted bootstrap confidence interval with 10,000 resamples. Computes a two-sided p-value under the null hypothesis of no difference.
Effect size: Cohen’s d is always computed for primary metrics. Values around 0.2 are small, 0.5 medium, and 0.8 large.
Result Objects¶
ComparisonResult provides these properties:
metrics: aggregate metric comparisons with significance resultssamples: per-sample score comparisons with direction classificationregressions: samples where candidate scored lowerimprovements: samples where candidate scored higherunchanged: samples with identical scoresaligned_count,missing_count,new_count: alignment countswin_rate: fraction of aligned samples where candidate wonsummary(): formatted text report
API Reference¶
- inspect_mlflow.comparison.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult¶
Compare results from two evaluation runs.
Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.
- Parameters:
baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.
- Returns:
ComparisonResult with metrics, sample comparisons, and regressions.
- inspect_mlflow.comparison._statistics.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) float | None¶
Compute Cohen’s d effect size for paired samples.
Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.
- Parameters:
baseline_scores – Per-sample scores from baseline.
candidate_scores – Per-sample scores from candidate.
- Returns:
Cohen’s d value, or None if fewer than 2 samples.
- class inspect_mlflow.comparison.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶
Complete comparison of two evaluation runs.
- property improvements: list[SampleComparison]¶
Samples where the candidate scored higher than baseline.
- metrics: list[MetricComparison]¶
Aggregate metric comparisons.
- property regressions: list[SampleComparison]¶
Samples where the candidate scored lower than baseline.
- samples: list[SampleComparison]¶
Per-sample score comparisons.
- property unchanged: list[SampleComparison]¶
Samples with identical scores in both runs.
- class inspect_mlflow.comparison.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶
Comparison of an aggregate metric between two evaluation runs.
- class inspect_mlflow.comparison.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶
Comparison of a single sample’s score between two runs.
- baseline_score: float | None¶
Score value in the baseline run. None if sample missing from baseline.
- candidate_score: float | None¶
Score value in the candidate run. None if sample missing from candidate.