Evaluation Comparison ===================== Compare results from two Inspect AI evaluation runs to detect score regressions, compute statistical significance, and generate comparison reports. Quick Start ----------- .. code-block:: python from inspect_mlflow.comparison import compare_evals result = compare_evals("logs/baseline.eval", "logs/candidate.eval") print(result.summary()) for r in result.regressions: print(f"Sample {r.id}: {r.baseline_score} -> {r.candidate_score}") Output: .. code-block:: text Baseline: openai/gpt-4o-mini (math_task) Candidate: openai/gpt-4o-mini (math_task) Samples: 5 aligned, 0 missing, 0 new Metric Baseline Candidate Delta Sig. ------------------------------------------------------------------- match/accuracy 0.6000 0.4000 -0.2000 (-33.3%) p=0.048* Effect size (match/accuracy): Cohen's d = -0.73 (medium effect) Regressions: 2, Improvements: 1, Unchanged: 2 Candidate won on 1 of 5 samples (20.0%) Features -------- - **Sample alignment** by ``(id, epoch)`` key with string/int ID normalization - **Automatic test selection**: McNemar's test for binary scores (0/1), bootstrap CI for continuous - **Effect size**: Cohen's d computed independently of sample size - **Regression threshold**: filter noise with ``regression_threshold=0.05`` - **Sample filtering**: ``sample_filter=lambda s: s.id in subset`` - **Win rate** tracking across aligned samples - **No scipy dependency**: all statistics implemented with NumPy only Parameters ---------- .. list-table:: :header-rows: 1 * - Parameter - Default - Description * - ``baseline`` - (required) - Path to baseline eval log or ``EvalLog`` object * - ``candidate`` - (required) - Path to candidate eval log or ``EvalLog`` object * - ``scorers`` - ``None`` - Scorer names to compare. ``None`` compares all common scorers * - ``significance`` - ``0.05`` - P-value threshold for significance tests * - ``regression_threshold`` - ``0.0`` - Minimum delta to count as regression or improvement * - ``sample_filter`` - ``None`` - Function to filter samples before comparison Statistical Tests ----------------- The comparison module selects the appropriate test based on score distribution: **Binary scores** (all values are 0.0 or 1.0): McNemar's test with continuity correction. Tests whether discordant pairs (one run correct, other incorrect) are asymmetrically distributed. **Continuous scores**: Shifted bootstrap confidence interval with 10,000 resamples. Computes a two-sided p-value under the null hypothesis of no difference. **Effect size**: Cohen's d is always computed for primary metrics. Values around 0.2 are small, 0.5 medium, and 0.8 large. Result Objects -------------- ``ComparisonResult`` provides these properties: - ``metrics``: aggregate metric comparisons with significance results - ``samples``: per-sample score comparisons with direction classification - ``regressions``: samples where candidate scored lower - ``improvements``: samples where candidate scored higher - ``unchanged``: samples with identical scores - ``aligned_count``, ``missing_count``, ``new_count``: alignment counts - ``win_rate``: fraction of aligned samples where candidate won - ``summary()``: formatted text report API Reference ------------- .. autofunction:: inspect_mlflow.comparison.compare_evals .. autofunction:: inspect_mlflow.comparison._statistics.cohens_d .. autoclass:: inspect_mlflow.comparison.ComparisonResult :members: .. autoclass:: inspect_mlflow.comparison.MetricComparison :members: .. autoclass:: inspect_mlflow.comparison.SampleComparison :members: