API Reference

MLflow integration for Inspect AI.

Provides experiment tracking, execution tracing, evaluation comparison, and Scout analysis for Inspect AI evaluations via MLflow.

Install and use:

pip install inspect-mlflow

# Set env vars export MLFLOW_TRACKING_URI=”http://localhost:5000” export MLFLOW_INSPECT_TRACING=”true” # optional, enables tracing

# Run evals as usual. Hooks auto-activate. inspect eval my_task.py

class inspect_mlflow.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)

Complete comparison of two evaluation runs.

property aligned_count: int

Number of samples present in both runs.

baseline_log: str

Path to baseline log file.

baseline_model: str

Model name from baseline.

baseline_task: str

Task name from baseline.

candidate_log: str

Path to candidate log file.

candidate_model: str

Model name from candidate.

candidate_task: str

Task name from candidate.

property improvements: list[SampleComparison]

Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]

Aggregate metric comparisons.

property missing_count: int

Samples in baseline but not in candidate.

property new_count: int

Samples in candidate but not in baseline.

property regressions: list[SampleComparison]

Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]

Per-sample score comparisons.

summary() str

Generate a text summary of the comparison.

property unchanged: list[SampleComparison]

Samples with identical scores in both runs.

property win_rate: float | None

Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float

Metric value in the baseline run.

candidate_value: float

Metric value in the candidate run.

ci_lower: float | None

Lower bound of confidence interval for the difference.

ci_upper: float | None

Upper bound of confidence interval for the difference.

delta: float

Absolute difference (candidate - baseline).

effect_size: float | None = None

Cohen’s d effect size. None if not computed.

name: str

Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None

P-value from significance test. None if not computed.

relative_delta: float | None

Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str

Scorer that produced this metric.

significant: bool

Whether the difference is statistically significant.

class inspect_mlflow.MlflowTracingHooks

MLflow Tracing Hooks.

Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.

enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_sample_start(data: SampleStart) None

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample start data.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings
class inspect_mlflow.MlflowTrackingHooks

Tracks Inspect AI evaluations in MLflow with hierarchical runs.

Uses MlflowClient API for isolation from user mlflow state.

property artifact_manager: ArtifactManager
property client: MlflowClient
enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_model_usage(data: ModelUsageData) None

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

Parameters:

data – Model usage data.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings
class inspect_mlflow.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])

Comparison of a single sample’s score between two runs.

baseline_score: float | None

Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None

Score value in the candidate run. None if sample missing from candidate.

delta: float | None

Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']

Classification of the score change between runs.

epoch: int

Epoch number.

id: int | str

Sample ID.

scorer: str

Scorer that produced this score.

inspect_mlflow.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:
  • baseline – Path to baseline eval log, or an EvalLog object.

  • candidate – Path to candidate eval log, or an EvalLog object.

  • scorers – Specific scorer names to compare. None compares all.

  • significance – P-value threshold for significance tests.

  • regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).

  • sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

async inspect_mlflow.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) AsyncIterator[Transcript]

Import MLflow traces as Scout transcripts.

Parameters:
  • experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.

  • tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.

  • limit – Maximum number of traces to import. None for all.

Yields:

Transcript objects ready for Scout database insertion.

MLflow Tracking hook for Inspect AI.

Logs evaluation runs, task configurations, sample scores, and model usage to an MLflow tracking server. Creates a parent run per eval run with nested child runs per task.

Uses MlflowClient API to avoid contaminating global mlflow state, so user code that calls mlflow.start_run() independently will not conflict.

Activated automatically when MLFLOW_TRACKING_URI is set.

class inspect_mlflow.tracking.MlflowTrackingHooks

Tracks Inspect AI evaluations in MLflow with hierarchical runs.

Uses MlflowClient API for isolation from user mlflow state.

property artifact_manager: ArtifactManager
property client: MlflowClient
enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_model_usage(data: ModelUsageData) None

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

Parameters:

data – Model usage data.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings

MLflow Tracing hook for Inspect AI.

Maps evaluation execution flow to MLflow trace spans, giving users the MLflow trace UI for debugging why a particular sample scored the way it did.

Creates a span tree mirroring the eval hierarchy:

eval_run (root)
task: math_reasoning (CHAIN)
sample: q1 (CHAIN)

model_call: gpt-4o (LLM) - 847 tokens, 1.2s tool_call: calculator (TOOL) - args: {“expr”: “2+2”}, result: 4 score: accuracy (EVALUATOR) - value: C

Activated automatically when both MLFLOW_TRACKING_URI and MLFLOW_INSPECT_TRACING=”true” are set.

class inspect_mlflow.tracing.MlflowTracingHooks

MLflow Tracing Hooks.

Creates MLflow trace spans from Inspect AI evaluation events. Each eval run produces a trace with hierarchical spans for tasks, samples, model calls, tool calls, and scoring.

enabled() bool

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

async on_run_end(data: RunEnd) None

On run end.

Parameters:

data – Run end data.

async on_run_start(data: RunStart) None

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

Parameters:

data – Run start data.

async on_sample_end(data: SampleEnd) None

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample end data.

async on_sample_event(data: SampleEvent) None

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

Parameters:

data – Sample event.

async on_sample_start(data: SampleStart) None

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

Parameters:

data – Sample start data.

async on_task_end(data: TaskEnd) None

On task end.

Parameters:

data – Task end data.

async on_task_start(data: TaskStart) None

On task start.

Parameters:

data – Task start data.

property settings: MLflowSettings

Evaluation comparison and regression detection.

Compare results from two Inspect AI evaluation runs to detect score regressions, compute statistical significance, and generate reports.

Example usage:

from inspect_mlflow.comparison import compare_evals

result = compare_evals("logs/baseline.eval", "logs/candidate.eval")
print(result.summary())

for r in result.regressions:
    print(f"Sample {r.id} regressed: {r.baseline_score} -> {r.candidate_score}")
class inspect_mlflow.comparison.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)

Complete comparison of two evaluation runs.

property aligned_count: int

Number of samples present in both runs.

baseline_log: str

Path to baseline log file.

baseline_model: str

Model name from baseline.

baseline_task: str

Task name from baseline.

candidate_log: str

Path to candidate log file.

candidate_model: str

Model name from candidate.

candidate_task: str

Task name from candidate.

property improvements: list[SampleComparison]

Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]

Aggregate metric comparisons.

property missing_count: int

Samples in baseline but not in candidate.

property new_count: int

Samples in candidate but not in baseline.

property regressions: list[SampleComparison]

Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]

Per-sample score comparisons.

summary() str

Generate a text summary of the comparison.

property unchanged: list[SampleComparison]

Samples with identical scores in both runs.

property win_rate: float | None

Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.comparison.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float

Metric value in the baseline run.

candidate_value: float

Metric value in the candidate run.

ci_lower: float | None

Lower bound of confidence interval for the difference.

ci_upper: float | None

Upper bound of confidence interval for the difference.

delta: float

Absolute difference (candidate - baseline).

effect_size: float | None = None

Cohen’s d effect size. None if not computed.

name: str

Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None

P-value from significance test. None if not computed.

relative_delta: float | None

Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str

Scorer that produced this metric.

significant: bool

Whether the difference is statistically significant.

class inspect_mlflow.comparison.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])

Comparison of a single sample’s score between two runs.

baseline_score: float | None

Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None

Score value in the candidate run. None if sample missing from candidate.

delta: float | None

Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']

Classification of the score change between runs.

epoch: int

Epoch number.

id: int | str

Sample ID.

scorer: str

Scorer that produced this score.

inspect_mlflow.comparison.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) float | None

Compute Cohen’s d effect size for paired samples.

Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.

Parameters:
  • baseline_scores – Per-sample scores from baseline.

  • candidate_scores – Per-sample scores from candidate.

Returns:

Cohen’s d value, or None if fewer than 2 samples.

inspect_mlflow.comparison.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:
  • baseline – Path to baseline eval log, or an EvalLog object.

  • candidate – Path to candidate eval log, or an EvalLog object.

  • scorers – Specific scorer names to compare. None compares all.

  • significance – P-value threshold for significance tests.

  • regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).

  • sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

Core comparison logic for evaluation runs.

Loads two eval logs, aligns samples, computes score deltas, runs significance tests, and returns a structured ComparisonResult.

inspect_mlflow.comparison._compare.compare_evals(baseline: str | Path | EvalLog, candidate: str | Path | EvalLog, scorers: list[str] | None = None, significance: float = 0.05, regression_threshold: float = 0.0, sample_filter: Callable[[EvalSample], bool] | None = None) ComparisonResult

Compare results from two evaluation runs.

Loads both logs, aligns samples by (id, epoch), computes score deltas and aggregate metric differences, and runs significance tests on the differences.

Parameters:
  • baseline – Path to baseline eval log, or an EvalLog object.

  • candidate – Path to candidate eval log, or an EvalLog object.

  • scorers – Specific scorer names to compare. None compares all.

  • significance – P-value threshold for significance tests.

  • regression_threshold – Minimum absolute delta to count as regression or improvement. Deltas within this threshold are classified as unchanged. Default 0.0 (any difference counts).

  • sample_filter – Optional function to filter samples before comparison. Only samples where filter returns True are included.

Returns:

ComparisonResult with metrics, sample comparisons, and regressions.

Data types for evaluation comparison results.

class inspect_mlflow.comparison._types.ComparisonResult(baseline_log: str, candidate_log: str, baseline_task: str, candidate_task: str, baseline_model: str, candidate_model: str, metrics: list[MetricComparison] = <factory>, samples: list[SampleComparison] = <factory>)

Complete comparison of two evaluation runs.

property aligned_count: int

Number of samples present in both runs.

baseline_log: str

Path to baseline log file.

baseline_model: str

Model name from baseline.

baseline_task: str

Task name from baseline.

candidate_log: str

Path to candidate log file.

candidate_model: str

Model name from candidate.

candidate_task: str

Task name from candidate.

property improvements: list[SampleComparison]

Samples where the candidate scored higher than baseline.

metrics: list[MetricComparison]

Aggregate metric comparisons.

property missing_count: int

Samples in baseline but not in candidate.

property new_count: int

Samples in candidate but not in baseline.

property regressions: list[SampleComparison]

Samples where the candidate scored lower than baseline.

samples: list[SampleComparison]

Per-sample score comparisons.

summary() str

Generate a text summary of the comparison.

property unchanged: list[SampleComparison]

Samples with identical scores in both runs.

property win_rate: float | None

Fraction of aligned samples where candidate outperformed baseline.

class inspect_mlflow.comparison._types.MetricComparison(name: str, scorer: str, baseline_value: float, candidate_value: float, delta: float, relative_delta: float | None, significant: bool, p_value: float | None, ci_lower: float | None, ci_upper: float | None, effect_size: float | None = None)

Comparison of an aggregate metric between two evaluation runs.

baseline_value: float

Metric value in the baseline run.

candidate_value: float

Metric value in the candidate run.

ci_lower: float | None

Lower bound of confidence interval for the difference.

ci_upper: float | None

Upper bound of confidence interval for the difference.

delta: float

Absolute difference (candidate - baseline).

effect_size: float | None = None

Cohen’s d effect size. None if not computed.

name: str

Metric name (e.g., ‘accuracy’, ‘mean’).

p_value: float | None

P-value from significance test. None if not computed.

relative_delta: float | None

Relative change as a fraction (delta / baseline). None if baseline is zero.

scorer: str

Scorer that produced this metric.

significant: bool

Whether the difference is statistically significant.

class inspect_mlflow.comparison._types.SampleComparison(id: int | str, epoch: int, scorer: str, baseline_score: float | None, candidate_score: float | None, delta: float | None, direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing'])

Comparison of a single sample’s score between two runs.

baseline_score: float | None

Score value in the baseline run. None if sample missing from baseline.

candidate_score: float | None

Score value in the candidate run. None if sample missing from candidate.

delta: float | None

Score difference (candidate - baseline). None if either score is missing.

direction: Literal['improved', 'regressed', 'unchanged', 'new', 'missing']

Classification of the score change between runs.

epoch: int

Epoch number.

id: int | str

Sample ID.

scorer: str

Scorer that produced this score.

Statistical tests for evaluation comparison.

Provides bootstrap confidence intervals, McNemar’s test for paired binary outcomes, permutation tests, and effect size. Uses NumPy only (no scipy dependency).

class inspect_mlflow.comparison._statistics.SignificanceResult(significant: bool, p_value: float, ci_lower: float, ci_upper: float, method: str)

Result of a statistical significance test.

ci_lower: float

Lower bound of confidence interval for the difference.

ci_upper: float

Upper bound of confidence interval for the difference.

method: str

Name of the test used.

p_value: float

Computed p-value.

significant: bool

Whether the difference is significant at the given level.

inspect_mlflow.comparison._statistics.bootstrap_ci(baseline_scores: list[float], candidate_scores: list[float], significance: float = 0.05, n_resamples: int = 10000, seed: int | None = 42) SignificanceResult

Compute bootstrap confidence interval for the difference in means.

Uses the shifted bootstrap (centered under H0) for proper two-sided p-value computation. Vectorized in batches for large datasets.

Parameters:
  • baseline_scores – Per-sample scores from baseline run.

  • candidate_scores – Per-sample scores from candidate run.

  • significance – Significance level (default 0.05 for 95% CI).

  • n_resamples – Number of bootstrap resamples.

  • seed – Random seed for reproducibility.

Returns:

SignificanceResult with CI bounds and p-value.

inspect_mlflow.comparison._statistics.cohens_d(baseline_scores: list[float], candidate_scores: list[float]) float | None

Compute Cohen’s d effect size for paired samples.

Measures the practical significance of the difference between two sets of scores, independent of sample size. Values around 0.2 are small, 0.5 medium, and 0.8 large.

Parameters:
  • baseline_scores – Per-sample scores from baseline.

  • candidate_scores – Per-sample scores from candidate.

Returns:

Cohen’s d value, or None if fewer than 2 samples.

inspect_mlflow.comparison._statistics.mcnemars_test(baseline_correct: list[bool], candidate_correct: list[bool], significance: float = 0.05) SignificanceResult

McNemar’s test for paired binary outcomes.

Tests whether the rate of discordant pairs (one correct, one incorrect) differs significantly between runs. Uses the chi-square approximation with continuity correction.

Parameters:
  • baseline_correct – Per-sample correctness from baseline (True/False).

  • candidate_correct – Per-sample correctness from candidate (True/False).

  • significance – Significance level.

Returns:

SignificanceResult with p-value.

inspect_mlflow.comparison._statistics.permutation_test(baseline_scores: list[float], candidate_scores: list[float], significance: float = 0.05, n_iterations: int = 10000, seed: int | None = 42) SignificanceResult

Two-sided permutation test for paired samples.

Randomly swaps baseline/candidate labels and computes the mean difference under the null hypothesis of no difference. Vectorized in batches for large datasets.

Parameters:
  • baseline_scores – Per-sample scores from baseline.

  • candidate_scores – Per-sample scores from candidate.

  • significance – Significance level.

  • n_iterations – Number of permutation iterations.

  • seed – Random seed for reproducibility.

Returns:

SignificanceResult with p-value.

Scout import source for MLflow traces.

Imports MLflow traces into an Inspect Scout transcript database, enabling Scout scanners to analyze any MLflow-traced LLM application.

Usage:

from inspect_mlflow.scout import import_mlflow_traces from inspect_scout import transcripts_db

async with transcripts_db(“./my-transcripts”) as db:
await db.insert(import_mlflow_traces(

experiment_name=”inspect-mlflow-demo”, tracking_uri=”http://localhost:5000”,

))

async inspect_mlflow.scout.import_mlflow_traces(experiment_name: str | None = None, tracking_uri: str | None = None, limit: int | None = None) AsyncIterator[Transcript]

Import MLflow traces as Scout transcripts.

Parameters:
  • experiment_name – MLflow experiment name to import from. Defaults to MLFLOW_EXPERIMENT_NAME env var or “inspect_ai”.

  • tracking_uri – MLflow tracking server URI. Defaults to MLFLOW_TRACKING_URI env var.

  • limit – Maximum number of traces to import. None for all.

Yields:

Transcript objects ready for Scout database insertion.

inspect_mlflow.scout.traces_request_time(trace: Any) str | None

Extract request time from trace info.

Autolog utilities for MLflow hook integration.

inspect_mlflow._autolog.enable_autolog(models: list[str], *, find_spec: ~collections.abc.Callable[[str], ~typing.Any] = <function find_spec>, import_module: ~collections.abc.Callable[[str], ~typing.Any] = <function import_module>) bool

Enable MLflow autolog for selected model providers.

Returns True if at least one provider was enabled.

Artifact table extraction and shaping helpers for MLflow tracking.

inspect_mlflow.artifacts.tables.extract_event_rows(*, eval_id: str, task_name: str, sample_id: Any, sample: Any) list[dict[str, Any]]
inspect_mlflow.artifacts.tables.extract_inspect_table_rows(*, eval_id: str, task_name: str, log: Any) dict[str, list[dict[str, Any]]]

Build inspect table rows from eval log content.

inspect_mlflow.artifacts.tables.extract_message_rows(*, eval_id: str, task_name: str, sample_id: Any, sample: Any) list[dict[str, Any]]
inspect_mlflow.artifacts.tables.extract_model_usage_rows(*, eval_id: str, task_name: str, sample_id: Any, sample: Any) list[dict[str, Any]]
inspect_mlflow.artifacts.tables.extract_sample_score_rows(*, eval_id: str, task_name: str, sample_id: Any, scores: Any) list[dict[str, Any]]
inspect_mlflow.artifacts.tables.extract_usage_from_events(sample: Any) dict[str, dict[str, int]]
inspect_mlflow.artifacts.tables.get_sample_output_text(sample: Any) str | None
inspect_mlflow.artifacts.tables.obj_get(obj: Any, key: str) Any
inspect_mlflow.artifacts.tables.rows_to_columns(rows: list[dict[str, Any]]) dict[str, list[Any]]
inspect_mlflow.artifacts.tables.scores_to_dict(scores: Any) dict[str, Any]
inspect_mlflow.artifacts.tables.sum_usage_map(usage_map: Any) dict[str, int]
inspect_mlflow.artifacts.tables.to_json(value: Any) str | int | float | bool | None
inspect_mlflow.artifacts.tables.to_string(value: Any) str | None
inspect_mlflow.artifacts.tables.usage_to_dict(usage: Any) dict[str, int]

Artifact manager for MLflow tracking hook.

class inspect_mlflow.artifacts.manager.ArtifactManager(client: MlflowClient, logger: Logger | None = None)

Handle Inspect artifact extraction and MLflow artifact logging.

load_full_eval_log(log: Any) Any | None
log_eval_artifacts(run_id: str, log: Any) None
log_eval_json(run_id: str, log: Any) None
log_inspect_tables(run_id: str, log: Any) None
log_sample_table(run_id: str, log: Any) None

Configuration for inspect-mlflow hooks.

Uses pydantic-settings when available for typed, validated config with the INSPECT_MLFLOW_ prefix. Falls back to os.getenv() when pydantic-settings is not installed.

class inspect_mlflow.config.MLflowSettings(tracking_uri: str | None = None, experiment_name: str = 'inspect_ai', tracing_enabled: bool = False, log_artifacts: bool = True, autolog_enabled: bool = True, autolog_models: list[str] = <factory>)

Fallback settings using os.getenv() when pydantic-settings is not installed.

autolog_enabled: bool = True
autolog_models: list[str]
experiment_name: str = 'inspect_ai'
log_artifacts: bool = True
tracing_enabled: bool = False
tracking_uri: str | None = None
inspect_mlflow.config.load_settings() MLflowSettings

Shared utilities for MLflow hooks.

inspect_mlflow.util.safe_log_params(mlflow: Any, params: dict[str, Any]) None

Log params, truncating values that exceed MLflow’s 500-char limit.

inspect_mlflow.util.score_to_numeric(value: Any) float | None

Convert a Score value to a numeric value for MLflow metrics.

Handles Inspect AI score conventions: - int/float: returned as-is - bool: True -> 1.0, False -> 0.0 - str: “C”/”correct” -> 1.0, “I”/”incorrect” -> 0.0, “P”/”partial” -> 0.5 - other: None (metric skipped)

inspect_mlflow.util.truncate(text: Any, max_len: int = 200) str

Truncate text to max_len characters, adding ellipsis if truncated.