pytest_texts_score package

Main entry point for the pytest-texts-score public API.

This module exposes the primary functions for text-based scoring and assertions within pytest. It includes functions for single-run evaluations (texts_expect_*) and multi-run, aggregated evaluations (texts_agg_*) for metrics like F1, precision, and recall.

It also provides aliases like “completeness” for precision and “correctness” for recall, which can be more intuitive in certain testing contexts.

pytest_texts_score.texts_agg_completeness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_average.

pytest_texts_score.texts_agg_completeness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_max.

pytest_texts_score.texts_agg_completeness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_mean.

pytest_texts_score.texts_agg_completeness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_median.

pytest_texts_score.texts_agg_completeness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_min.

pytest_texts_score.texts_agg_correctness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_average.

pytest_texts_score.texts_agg_correctness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_max.

pytest_texts_score.texts_agg_correctness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_mean.

pytest_texts_score.texts_agg_correctness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_median.

pytest_texts_score.texts_agg_correctness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_min.

pytest_texts_score.texts_agg_f1_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_f1_mean.

pytest_texts_score.texts_agg_f1_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the maximum aggregated F1 score is below an upper bound.

Performs multiple evaluation runs, calculates the maximum F1 score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • upper_bound (float) – The maximum acceptable score for the aggregated maximum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_f1_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the mean aggregated F1 score is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) F1 score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected mean score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_f1_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the median aggregated F1 score is close to a target value.

Performs multiple evaluation runs, calculates the median F1 score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected median score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_f1_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the minimum aggregated F1 score is above a lower bound.

Performs multiple evaluation runs, calculates the minimum F1 score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • lower_bound (float) – The minimum acceptable score for the aggregated minimum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_mean.

pytest_texts_score.texts_agg_precision_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the maximum aggregated precision is below an upper bound.

Performs multiple evaluation runs, calculates the maximum precision score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • upper_bound (float) – The maximum acceptable score for the aggregated maximum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the mean aggregated precision is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) precision score, and asserts that it falls within the range target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected mean score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the median aggregated precision is close to a target value.

Performs multiple evaluation runs, calculates the median precision score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected median score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_precision_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the minimum aggregated precision is above a lower bound.

Performs multiple evaluation runs, calculates the minimum precision score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • lower_bound (float) – The minimum acceptable score for the aggregated minimum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_mean.

pytest_texts_score.texts_agg_recall_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the maximum aggregated recall is below an upper bound.

Performs multiple evaluation runs, calculates the maximum recall score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • upper_bound (float) – The maximum acceptable score for the aggregated maximum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the mean aggregated recall is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) recall score, and asserts that it falls within the range target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected mean score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the median aggregated recall is close to a target value.

Performs multiple evaluation runs, calculates the median recall score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected median score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_agg_recall_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the minimum aggregated recall is above a lower bound.

Performs multiple evaluation runs, calculates the minimum recall score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • lower_bound (float) – The minimum acceptable score for the aggregated minimum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_completeness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_precision_equal.

pytest_texts_score.texts_expect_completeness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_precision_range.

pytest_texts_score.texts_expect_correctness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_recall_equal.

pytest_texts_score.texts_expect_correctness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_recall_range.

pytest_texts_score.texts_expect_f1_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the F1 score is close to a target value.

This is a convenience wrapper around texts_expect_f1_range(). It performs a single F1 score evaluation and asserts that the result is within target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected F1 score. Defaults to 1.0.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.2.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_f1_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the F1 score falls within a specified range.

This function performs a single evaluation of the F1 score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • min_score (float) – The minimum acceptable F1 score.

  • max_score (float) – The maximum acceptable F1 score.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_precision_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the precision score is close to a target value.

This is a convenience wrapper around texts_expect_precision_range(). It performs a single precision score evaluation and asserts that the result is within target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected precision score. Defaults to 1.0.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.2.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_precision_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the precision score falls within a specified range.

This function performs a single evaluation of the precision score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • min_score (float) – The minimum acceptable precision score.

  • max_score (float) – The maximum acceptable precision score.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_recall_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the recall score is close to a target value.

This is a convenience wrapper around texts_expect_recall_range(). It performs a single recall score evaluation and asserts that the result is within target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected recall score. Defaults to 1.0.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.2.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.texts_expect_recall_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the recall score falls within a specified range.

This function performs a single evaluation of the recall score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • min_score (float) – The minimum acceptable recall score.

  • max_score (float) – The maximum acceptable recall score.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

Submodules

pytest_texts_score.api module

pytest_texts_score.api.MINIMAL_EXPECTED_MAX_DELTA = 0.05

A recommended minimum value for the max_delta or range width. Used to warn users if their test’s acceptance criteria are very strict, which might lead to flaky tests due to LLM non-determinism.

pytest_texts_score.api.texts_agg_f1_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the maximum aggregated F1 score is below an upper bound.

Performs multiple evaluation runs, calculates the maximum F1 score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • upper_bound (float) – The maximum acceptable score for the aggregated maximum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_f1_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the mean aggregated F1 score is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) F1 score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected mean score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_f1_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the median aggregated F1 score is close to a target value.

Performs multiple evaluation runs, calculates the median F1 score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected median score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_f1_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the minimum aggregated F1 score is above a lower bound.

Performs multiple evaluation runs, calculates the minimum F1 score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • lower_bound (float) – The minimum acceptable score for the aggregated minimum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_precision_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the maximum aggregated precision is below an upper bound.

Performs multiple evaluation runs, calculates the maximum precision score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • upper_bound (float) – The maximum acceptable score for the aggregated maximum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_precision_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the mean aggregated precision is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) precision score, and asserts that it falls within the range target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected mean score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_precision_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the median aggregated precision is close to a target value.

Performs multiple evaluation runs, calculates the median precision score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected median score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_precision_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the minimum aggregated precision is above a lower bound.

Performs multiple evaluation runs, calculates the minimum precision score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • lower_bound (float) – The minimum acceptable score for the aggregated minimum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_recall_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the maximum aggregated recall is below an upper bound.

Performs multiple evaluation runs, calculates the maximum recall score across all runs, and asserts that this maximum score is less than or equal to upper_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • upper_bound (float) – The maximum acceptable score for the aggregated maximum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_recall_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the mean aggregated recall is close to a target value.

Performs multiple evaluation runs, calculates the mean (average) recall score, and asserts that it falls within the range target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected mean score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_recall_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the median aggregated recall is close to a target value.

Performs multiple evaluation runs, calculates the median recall score, and asserts that it falls within the range defined by target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected median score.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.1.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_agg_recall_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Assert that the minimum aggregated recall is above a lower bound.

Performs multiple evaluation runs, calculates the minimum recall score across all runs, and asserts that this minimum score is greater than or equal to lower_bound.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • lower_bound (float) – The minimum acceptable score for the aggregated minimum.

  • full_runs (int) – Number of times to generate new questions. Defaults to 5.

  • each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_expect_f1_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the F1 score is close to a target value.

This is a convenience wrapper around texts_expect_f1_range(). It performs a single F1 score evaluation and asserts that the result is within target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected F1 score. Defaults to 1.0.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.2.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_expect_f1_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the F1 score falls within a specified range.

This function performs a single evaluation of the F1 score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • min_score (float) – The minimum acceptable F1 score.

  • max_score (float) – The maximum acceptable F1 score.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_expect_precision_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the precision score is close to a target value.

This is a convenience wrapper around texts_expect_precision_range(). It performs a single precision score evaluation and asserts that the result is within target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected precision score. Defaults to 1.0.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.2.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_expect_precision_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the precision score falls within a specified range.

This function performs a single evaluation of the precision score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • min_score (float) – The minimum acceptable precision score.

  • max_score (float) – The maximum acceptable precision score.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_expect_recall_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the recall score is close to a target value.

This is a convenience wrapper around texts_expect_recall_range(). It performs a single recall score evaluation and asserts that the result is within target ± max_delta.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • target (float) – The expected recall score. Defaults to 1.0.

  • max_delta (float) – The allowed deviation from the target. Defaults to 0.2.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api.texts_expect_recall_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Assert that the recall score falls within a specified range.

This function performs a single evaluation of the recall score between the expected and given texts. It then asserts that the resulting score is between min_score and max_score.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • min_score (float) – The minimum acceptable recall score.

  • max_score (float) – The maximum acceptable recall score.

  • skip_warnings (bool) – If True, suppresses input validation warnings.

  • retry_on_error (bool) – If True, retries LLM calls on failure.

pytest_texts_score.api_wrappers module

This module provides wrapper functions for the public API, offering alternative names for existing functionality. For example, functions using ‘mean’ are aliased with ‘average’.

pytest_texts_score.api_wrappers.texts_agg_completeness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_average.

pytest_texts_score.api_wrappers.texts_agg_completeness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_max.

pytest_texts_score.api_wrappers.texts_agg_completeness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_mean.

pytest_texts_score.api_wrappers.texts_agg_completeness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_median.

pytest_texts_score.api_wrappers.texts_agg_completeness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_min.

pytest_texts_score.api_wrappers.texts_agg_correctness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_average.

pytest_texts_score.api_wrappers.texts_agg_correctness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_max.

pytest_texts_score.api_wrappers.texts_agg_correctness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_mean.

pytest_texts_score.api_wrappers.texts_agg_correctness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_median.

pytest_texts_score.api_wrappers.texts_agg_correctness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_min.

pytest_texts_score.api_wrappers.texts_agg_f1_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_f1_mean.

pytest_texts_score.api_wrappers.texts_agg_precision_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_precision_mean.

pytest_texts_score.api_wrappers.texts_agg_recall_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]

Alias for texts_agg_recall_mean.

pytest_texts_score.api_wrappers.texts_expect_completeness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_precision_equal.

pytest_texts_score.api_wrappers.texts_expect_completeness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_precision_range.

pytest_texts_score.api_wrappers.texts_expect_correctness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_recall_equal.

pytest_texts_score.api_wrappers.texts_expect_correctness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]

Alias for texts_expect_recall_range.

pytest_texts_score.client module

pytest_texts_score.client.get_client() AzureOpenAI[source]

Return the initialized AzureOpenAI client.

Retrieves the globally stored AzureOpenAI client instance. It is designed to be called after init_client() has been executed, typically within a pytest fixture.

Returns:

The initialized AzureOpenAI client instance.

Return type:

AzureOpenAI

Raises:

RuntimeError – If the client has not been initialized by calling init_client() first.

pytest_texts_score.client.init_client(config: Config) AzureOpenAI[source]

Initialize and store the global AzureOpenAI client.

This function uses the provided pytest configuration object to instantiate the AzureOpenAI client. The created client instance is stored in a global variable for later retrieval via get_client().

Parameters:

config (pytest.Config) – The pytest config object containing LLM settings.

Returns:

The newly created AzureOpenAI client instance.

Return type:

AzureOpenAI

pytest_texts_score.communication module

pytest_texts_score.communication.evaluate_questions(answer_text: str, questions_text: str) list[dict[str, Any]][source]

Evaluate how well a text answers a list of questions using the LLM.

This function sends the answer_text and a JSON string of questions_text to the configured Azure OpenAI model. The model is prompted to answer each question based on the text and provide a numeric score. The function parses the JSON response and returns the list of answers. It also handles and warns about responses that might include markdown ```json tags.

Parameters:
  • answer_text (str) – The text to use for answering the questions.

  • questions_text (str) – A JSON string representing the list of questions.

Returns:

A list of dictionaries, where each dictionary contains a ‘question’ and its corresponding ‘answer’ score.

Return type:

list[dict[str, Any]]

Raises:
  • ValueError – If the LLM response is not valid JSON or cannot be parsed.

  • openai.APIError – If the API call to the LLM fails.

pytest_texts_score.communication.make_questions(base_text: str) str[source]

Generate questions from a given text using the LLM.

This function sends the base_text to the configured Azure OpenAI model with a system prompt designed to elicit factual yes/no questions. It retrieves the global configuration and client instance to make the API call.

Parameters:

base_text (str) – The text from which to generate questions.

Returns:

A JSON string containing the generated questions. Returns an empty string if the model response content is empty.

Return type:

str

Raises:

openai.APIError – If the API call to the LLM fails.

pytest_texts_score.evaluate_score module

class pytest_texts_score.evaluate_score.AggType(*values)[source]

Bases: str, Enum

Aggregation types for recall scores.

AVERAGE = 'average'
MAXIMUM = 'maximum'
MEAN = 'mean'
MEDIAN = 'median'
MINIMUM = 'minimum'
pytest_texts_score.evaluate_score.MAXIMAL_RETRY_ON_ERROR = 5

The maximum number of times to retry an LLM call upon failure before raising an exception.

class pytest_texts_score.evaluate_score.ScoreType(*values)[source]

Bases: str, Enum

F1 = 'f1'
PRECISION = 'precision'
RECALL = 'recall'
pytest_texts_score.evaluate_score.f1_score(precision: float, recall: float) float[source]

Calculate the F1 score from precision and recall.

Computes the harmonic mean of precision and recall. Returns 0 if both precision and recall are 0 to avoid division by zero.

Parameters:
  • precision (float) – The precision score (between 0.0 and 1.0).

  • recall (float) – The recall score (between 0.0 and 1.0).

Returns:

The F1 score.

Return type:

float

pytest_texts_score.evaluate_score.score_one_side(base_text: str, answer_text: str, retry_on_error: bool = True) float[source]

Calculate a one-sided score by generating questions from one text and answering with another.

This is a fundamental building block for both precision and recall calculations. It generates a set of questions based on base_text and then evaluates how well answer_text can answer them. The final score is the average of the answer scores. This process forms the basis for calculating both precision and recall.

Parameters:
  • base_text (str) – The text to generate questions from.

  • answer_text (str) – The text to answer the questions with.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

The average score from the evaluation.

Return type:

float

Raises:

Exception – If the operation fails after the maximum number of retries.

pytest_texts_score.evaluate_score.scores_agg(scores: list[float], agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean']) float[source]

Aggregate a list of scores using a specified method.

This function takes a list of numeric scores and applies an aggregation function (min, max, median, or mean/average) to produce a single summary score.

Parameters:
  • scores (list[float]) – A list of scores to aggregate.

  • agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use.

Returns:

The aggregated score.

Return type:

float

Raises:

ValueError – If an unknown aggregation type is provided.

pytest_texts_score.evaluate_score.texts_agg_f1(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean'], retry_on_error: bool = True) float[source]

Calculate an aggregated F1 score over multiple runs.

This function first generates multiple F1 scores by calling texts_multiple_f1 and then aggregates these scores using the specified agg_type.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • generate_questions (int) – The number of times to generate a new set of questions.

  • generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.

  • agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use on the collected scores.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

The final aggregated F1 score.

Return type:

float

pytest_texts_score.evaluate_score.texts_agg_precision(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean'], retry_on_error: bool = True) float[source]

Calculate an aggregated precision score over multiple runs.

This function first generates multiple precision scores by calling texts_multiple_precision and then aggregates these scores using the specified agg_type.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • generate_questions (int) – The number of times to generate a new set of questions.

  • generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.

  • agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use on the collected scores.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

The final aggregated precision score.

Return type:

float

pytest_texts_score.evaluate_score.texts_agg_recall(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean'], retry_on_error: bool = True) float[source]

Calculate an aggregated recall score over multiple runs.

This function first generates multiple recall scores by calling texts_multiple_recall and then aggregates these scores using the specified agg_type.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • generate_questions (int) – The number of times to generate a new set of questions.

  • generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.

  • agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use on the collected scores.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

The final aggregated recall score.

Return type:

float

pytest_texts_score.evaluate_score.texts_evaluate_f1(expected: str, given: str, retry_on_error: bool = True) float[source]

Calculate the F1 score between two texts.

This function computes the F1 score by first calculating the precision and recall between the expected and given texts. It serves as a single-run evaluation of the harmonic mean of precision and recall.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to be evaluated against the reference.

  • retry_on_error (bool) – Whether to retry the LLM call on failure. Defaults to True.

Returns:

The calculated F1 score.

Return type:

float

pytest_texts_score.evaluate_score.texts_evaluate_precision(expected: str, given: str, retry_on_error: bool = True) float[source]

Evaluate the precision score of the given text against the expected text.

Precision is calculated by generating questions from the given text and checking how well they are answered by the expected text. This measures how much of the information in the given text is also present in the expected text.

Parameters:
  • expected (str) – The reference text used for answering questions.

  • given (str) – The text from which questions are generated.

  • retry_on_error (bool) – Whether to retry the LLM call on failure. Defaults to True.

Returns:

The calculated precision score.

Return type:

float

pytest_texts_score.evaluate_score.texts_evaluate_recall(expected: str, given: str, retry_on_error: bool = True) float[source]

Evaluate the recall score of the given text against the expected text.

Recall is calculated by generating questions from the expected text and checking how well they are answered by the given text. This measures how much of the information in the expected text is covered by the given text.

Parameters:
  • expected (str) – The reference text from which questions are generated.

  • given (str) – The text used for answering questions.

  • retry_on_error (bool) – Whether to retry the LLM call on failure. Defaults to True.

Returns:

The calculated recall score.

Return type:

float

pytest_texts_score.evaluate_score.texts_multiple_f1(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, score_only: bool = True, retry_on_error: bool = True) list[float] | list[tuple[int, int, float, float, float]][source]

Perform multiple evaluation runs to get a list of F1 scores.

This function runs the F1 score evaluation multiple times to account for variability in LLM responses. It generates new sets of questions for precision and recall in each generate_questions loop, and for each set, it evaluates answers generate_answers_per_questions times.

Parameters:
  • expected (str) – The reference text.

  • given (str) – The text to evaluate.

  • generate_questions (int) – The number of times to generate a new set of questions.

  • generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.

  • score_only (bool) – If True, returns only a list of F1 scores. If False, returns a list of tuples with detailed run info. Defaults to True.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

A list of F1 scores, or a list of tuples (question_run, answer_run, precision, recall, f1_score).

Return type:

list[float] | list[tuple[int, int, float, float, float]]

Raises:

Exception – If the operation fails after the maximum number of retries.

pytest_texts_score.evaluate_score.texts_multiple_precision(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, score_only: bool = True, retry_on_error: bool = True) list[float] | list[tuple[int, int, float]][source]

Perform multiple evaluation runs to get a list of precision scores.

This function runs the precision score evaluation multiple times. It generates new sets of questions from the given text in each generate_questions loop, and for each set, it evaluates answers generate_answers_per_questions times using the expected text.

Parameters:
  • expected (str) – The reference text for answering.

  • given (str) – The text to generate questions from.

  • generate_questions (int) – The number of times to generate a new set of questions.

  • generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.

  • score_only (bool) – If True, returns only a list of precision scores. If False, returns a list of tuples with detailed run info. Defaults to True.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

A list of precision scores, or a list of tuples (question_run, answer_run, precision).

Return type:

list[float] | list[tuple[int, int, float]]

Raises:

Exception – If the operation fails after the maximum number of retries.

pytest_texts_score.evaluate_score.texts_multiple_recall(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, score_only: bool = True, retry_on_error: bool = True) list[float] | list[tuple[int, int, float]][source]

Perform multiple evaluation runs to get a list of recall scores.

This function runs the recall score evaluation multiple times. It generates new sets of questions from the expected text in each generate_questions loop, and for each set, it evaluates answers generate_answers_per_questions times using the given text.

Parameters:
  • expected (str) – The reference text to generate questions from.

  • given (str) – The text for answering.

  • generate_questions (int) – The number of times to generate a new set of questions.

  • generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.

  • score_only (bool) – If True, returns only a list of recall scores. If False, returns a list of tuples with detailed run info. Defaults to True.

  • retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to True.

Returns:

A list of recall scores, or a list of tuples (question_run, answer_run, recall).

Return type:

list[float] | list[tuple[int, int, float]]

Raises:

Exception – If the operation fails after the maximum number of retries.

pytest_texts_score.plugin module

pytest_texts_score.plugin.get_config() Config[source]

Return the initialized pytest configuration object.

Retrieves the globally stored pytest.Config object, which contains the resolved LLM configuration. This function should be called after pytest_configure has run.

Returns:

The pytest config object.

Return type:

pytest.Config

Raises:

RuntimeError – If the configuration has not been initialized.

pytest_texts_score.plugin.mask_api_key(key: str | None) str | None[source]

Mask an API key for safe display.

Replaces all but the first character of the API key with asterisks (*) to prevent leaking sensitive information in logs or reports.

Parameters:

key (Optional[str]) – The API key to mask.

Returns:

The masked API key, or None if the input was None.

Return type:

Optional[str]

pytest_texts_score.plugin.pytest_addoption(parser: Parser) None[source]

Add command-line and .ini options for LLM configuration to pytest.

This hook implementation defines various options to configure the Azure OpenAI client, such as API key, endpoint, model, and other parameters. Options can be provided via the command line or a pytest.ini file.

Parameters:

parser (pytest.Parser) – The pytest option parser.

Returns:

None.

pytest_texts_score.plugin.pytest_configure(config: Config) None[source]

Resolve LLM config (CLI > ini > default) and initialize the client.

This hook is called after command line and configuration files are parsed. It resolves the final configuration values by prioritizing command-line options over .ini file settings, and then over default values. It validates that all required settings are present and then initializes the global LLM client.

Parameters:

config (pytest.Config) – The pytest config object.

Returns:

None.

Raises:

pytest.UsageError – If any required configuration values are missing.

pytest_texts_score.plugin.pytest_report_header(config: Config) str[source]

Add LLM configuration details to the pytest report header.

This hook provides a custom string to be displayed in the header of the test report, showing the resolved LLM configuration parameters for the current test run. The API key is masked for security.

Parameters:

config (pytest.Config) – The pytest config object.

Returns:

A string to be included in the report header.

Return type:

str

pytest_texts_score.plugin.texts_score() dict[str, Callable][source]

Provide access to text comparison helper functions as a fixture.

This fixture returns a dictionary of callable functions for text scoring and evaluation. These functions include various aggregation and expectation helpers for metrics like F1 score, precision, recall, completeness, and correctness.

Returns:

A dictionary mapping function names to callable helper functions.

Return type:

dict[str, Callable]

Note

This fixture returns a dictionary of functions rather than exposing them globally.

pytest_texts_score.plugin.texts_score_client() AzureOpenAI[source]

Provide access to the initialized LLM client as a fixture.

This session-scoped fixture allows tests to get the configured AzureOpenAI client instance.

Returns:

The initialized AzureOpenAI client.

Return type:

AzureOpenAI

pytest_texts_score.prompts module

Prompt templates used by pytest-texts-score. These prompts are carefully engineered to guide the LLM’s behavior for question generation and evaluation. Modifying them may have significant impacts on the scoring results.

pytest_texts_score.prompts.get_system_answers_prompt() str[source]

Get the system prompt for answering questions.

This function returns the predefined system prompt that instructs the LLM on how to answer a list of questions based on a given text, using a numeric scoring system.

Returns:

The question answering prompt string.

Return type:

str

pytest_texts_score.prompts.get_system_questions_prompt() str[source]

Get the system prompt for generating questions.

This function returns the predefined system prompt that instructs the LLM on how to generate factual yes/no questions from a given text.

Returns:

The question generation prompt string.

Return type:

str

pytest_texts_score.prompts.get_user_answers_prompt(answer_text: str, questions_text: str) str[source]

Create a user prompt for answering questions.

This function formats the text and the questions into a single prompt that will be paired with the system answer prompt.

Parameters:
  • answer_text (str) – The text to use for answering the questions.

  • questions_text (str) – The JSON string of questions to be answered.

Returns:

The formatted user prompt string.

Return type:

str

pytest_texts_score.prompts.get_user_questions_prompt(text: str) str[source]

Create a user prompt for question generation.

This function formats the user-provided text into a simple prompt that will be paired with the system question prompt.

Parameters:

text (str) – The text to generate questions from.

Returns:

The formatted user prompt string.

Return type:

str