pytest-texts-score documentation¶

A pytest plugin for semantic text similarity scoring using Large Language Models (LLMs).

It enables robust assertions over meaning, not surface text, making it ideal for validating LLM outputs, RAG systems, summaries, and other generated content.

The plugin evaluates similarity by prompting an LLM to extract and answer factual questions, producing Precision (Completeness), Recall (Correctness), and F1 scores.

Metrics overview¶

Recall (Correctness) Measures how much information from the expected text is present in the given text.
Precision (Completeness) Measures how much information in the given text is supported by the expected text.
F1 score Harmonic mean of precision and recall.

Aggregated assertions¶

These perform multiple evaluations and aggregate the result. Recommended for CI/CD pipelines to reduce LLM nondeterminism.

Supported aggregations: min, max, median, mean / average.

pytest_texts_score package¶

Main entry point for the pytest-texts-score public API.

This module exposes the primary functions for text-based scoring and assertions within pytest. It includes functions for single-run evaluations (texts_expect_*) and multi-run, aggregated evaluations (texts_agg_*) for metrics like F1, precision, and recall.

It also provides aliases like “completeness” for precision and “correctness” for recall, which can be more intuitive in certain testing contexts.

pytest_texts_score.texts_agg_completeness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_precision_average.

pytest_texts_score.texts_agg_completeness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_precision_max.

pytest_texts_score.texts_agg_completeness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_precision_mean.

pytest_texts_score.texts_agg_completeness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_precision_median.

pytest_texts_score.texts_agg_completeness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_precision_min.

pytest_texts_score.texts_agg_correctness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_recall_average.

pytest_texts_score.texts_agg_correctness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_recall_max.

pytest_texts_score.texts_agg_correctness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_recall_mean.

pytest_texts_score.texts_agg_correctness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_recall_median.

pytest_texts_score.texts_agg_correctness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_recall_min.

pytest_texts_score.texts_agg_f1_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_f1_mean.

pytest_texts_score.texts_agg_f1_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the maximum aggregated F1 score is below an upper bound.

Performs multiple evaluation runs, calculates the maximum F1 score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_f1_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the mean aggregated F1 score is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) F1 score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_f1_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the median aggregated F1 score is close to a target value.

Performs multiple evaluation runs, calculates the median F1 score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_f1_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the minimum aggregated F1 score is above a lower bound.

Performs multiple evaluation runs, calculates the minimum F1 score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_precision_mean.

pytest_texts_score.texts_agg_precision_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the maximum aggregated precision is below an upper bound.

Performs multiple evaluation runs, calculates the maximum precision score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the mean aggregated precision is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) precision score, and asserts that it falls within the range target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the median aggregated precision is close to a target value.

Performs multiple evaluation runs, calculates the median precision score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the minimum aggregated precision is above a lower bound.

Performs multiple evaluation runs, calculates the minimum precision score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]: Alias for texts_agg_recall_mean.

pytest_texts_score.texts_agg_recall_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the maximum aggregated recall is below an upper bound.

Performs multiple evaluation runs, calculates the maximum recall score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the mean aggregated recall is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) recall score, and asserts that it falls within the range target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the median aggregated recall is close to a target value.

Performs multiple evaluation runs, calculates the median recall score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) → None[source]

Assert that the minimum aggregated recall is above a lower bound.

Performs multiple evaluation runs, calculates the minimum recall score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_completeness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]: Alias for texts_expect_precision_equal.

pytest_texts_score.texts_expect_completeness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]: Alias for texts_expect_precision_range.

pytest_texts_score.texts_expect_correctness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]: Alias for texts_expect_recall_equal.

pytest_texts_score.texts_expect_correctness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]: Alias for texts_expect_recall_range.

pytest_texts_score.texts_expect_f1_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]

Assert that the F1 score is close to a target value.

This is a convenience wrapper around texts_expect_f1_range(). It performs a single F1 score evaluation and asserts that the result is within target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected F1 score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If True, suppresses input validation warnings.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_f1_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]

Assert that the F1 score falls within a specified range.

This function performs a single evaluation of the F1 score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable F1 score.
max_score (float) – The maximum acceptable F1 score.
skip_warnings (bool) – If True, suppresses input validation warnings.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_precision_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]

Assert that the precision score is close to a target value.

This is a convenience wrapper around texts_expect_precision_range(). It performs a single precision score evaluation and asserts that the result is within target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected precision score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If True, suppresses input validation warnings.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_precision_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]

Assert that the precision score falls within a specified range.

This function performs a single evaluation of the precision score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable precision score.
max_score (float) – The maximum acceptable precision score.
skip_warnings (bool) – If True, suppresses input validation warnings.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_recall_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]

Assert that the recall score is close to a target value.

This is a convenience wrapper around texts_expect_recall_range(). It performs a single recall score evaluation and asserts that the result is within target ± max_delta.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected recall score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If True, suppresses input validation warnings.
retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_recall_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) → None[source]

Assert that the recall score falls within a specified range.

This function performs a single evaluation of the recall score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:

expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable recall score.
max_score (float) – The maximum acceptable recall score.
skip_warnings (bool) – If True, suppresses input validation warnings.
retry_on_error (bool) – If True, retries LLM calls on failure.

Contents:

pytest_texts_score
- pytest_texts_score package

pytest-texts-score documentation¶

Metrics overview¶

Aggregated assertions¶

pytest_texts_score package¶

pytest-texts-score

Navigation

Related Topics