inspect-mlflow¶
MLflow integration for Inspect AI. Provides experiment tracking, execution tracing, evaluation comparison, LLM provider autolog, structured artifact tables, trace assessments, and Scout analysis for Inspect AI evaluations.
pip install inspect-mlflow
Set environment variables and run evals as usual. Hooks auto-register via entry points.
export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_INSPECT_TRACING="true"
inspect eval my_task.py --model openai/gpt-4o
Contents
- Tracking Hook
- Tracing Hook
- Evaluation Comparison
- Configuration
- Scout Import
- API Reference
ComparisonResultMetricComparisonMlflowTracingHooksMlflowTrackingHooksSampleComparisoncompare_evals()import_mlflow_traces()MlflowTrackingHooksMlflowTracingHooksComparisonResultMetricComparisonSampleComparisoncohens_d()compare_evals()compare_evals()ComparisonResultMetricComparisonSampleComparisonSignificanceResultbootstrap_ci()cohens_d()mcnemars_test()permutation_test()import_mlflow_traces()traces_request_time()enable_autolog()extract_event_rows()extract_inspect_table_rows()extract_message_rows()extract_model_usage_rows()extract_sample_score_rows()extract_usage_from_events()get_sample_output_text()obj_get()rows_to_columns()scores_to_dict()sum_usage_map()to_json()to_string()usage_to_dict()ArtifactManagerMLflowSettingsload_settings()safe_log_params()score_to_numeric()truncate()
API Reference¶
MLflow integration for Inspect AI.
Provides experiment tracking, execution tracing, evaluation comparison, and Scout analysis for Inspect AI evaluations via MLflow.
Install and use:
pip install inspect-mlflow
# Set env vars export MLFLOW_TRACKING_URI=”http://localhost:5000” export MLFLOW_INSPECT_TRACING=”true” # optional, enables tracing
# Run evals as usual. Hooks auto-activate. inspect eval my_task.py
- class inspect_mlflow.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)¶
Complete comparison of two evaluation runs.
- property improvements: list[SampleComparison]¶
Samples where the candidate scored higher than baseline.
- metrics: list[MetricComparison]¶
Aggregate metric comparisons.
- property regressions: list[SampleComparison]¶
Samples where the candidate scored lower than baseline.
- samples: list[SampleComparison]¶
Per-sample score comparisons.
- property unchanged: list[SampleComparison]¶
Samples with identical scores in both runs.
- class inspect_mlflow.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)¶
Comparison of an aggregate metric between two evaluation runs.
- class inspect_mlflow.MlflowTracingHooks¶
MLflow Tracing Hooks.
Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- async on_sample_start(data: SampleStart) None¶
On sample start.
Called when a sample is about to be start. If the sample errors and retries, this will not be called again.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample start data.
- property settings: MLflowSettings¶
- class inspect_mlflow.MlflowTrackingHooks¶
Tracks Inspect AI evaluations in MLflow with hierarchical runs.
Uses MlflowClient API for isolation from user mlflow state.
- property artifact_manager: ArtifactManager¶
- property client: MlflowClient¶
- enabled() bool¶
Check if the hook should be enabled.
Default implementation returns True.
Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.
Will be called frequently, so consider caching the result if the computation is expensive.
- async on_model_usage(data: ModelUsageData) None¶
Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.
Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.
- Parameters:
data – Model usage data.
- async on_run_start(data: RunStart) None¶
On run start.
A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().
- Parameters:
data – Run start data.
- async on_sample_end(data: SampleEnd) None¶
On sample end.
Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.
If a sample is run for multiple epochs, this will be called once per epoch.
- Parameters:
data – Sample end data.
- async on_sample_event(data: SampleEvent) None¶
On sample event.
Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).
- Parameters:
data – Sample event.
- property settings: MLflowSettings¶
- class inspect_mlflow.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])¶
Comparison of a single sample’s score between two runs.
- baseline_score: float | None¶
Score value in the baseline run. None if sample missing from baseline.
- candidate_score: float | None¶
Score value in the candidate run. None if sample missing from candidate.
- inspect_mlflow.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult¶
Compare results from two evaluation runs.
Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.
- Parameters:
baseline – Path to baseline eval log, or an EvalLog object.
candidate – Path to candidate eval log, or an EvalLog object.
scorers – Specific scorer names to compare. None compares all.
significance – P-value threshold for significance tests.
regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).
sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.
- Returns:
ComparisonResult with metrics, sample comparisons, and regressions.
- async inspect_mlflow.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) AsyncIterator[Transcript]¶
Import MLflow traces as Scout transcripts.
- Parameters:
experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.
tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.
limit – Maximum number of traces to import. None for all.
- Yields:
Transcript objects ready for Scout database insertion.