pytest_texts_score package¶
Main entry point for the pytest-texts-score public API.
This module exposes the primary functions for text-based scoring and assertions
within pytest. It includes functions for single-run evaluations
(texts_expect_*) and multi-run, aggregated evaluations (texts_agg_*) for
metrics like F1, precision, and recall.
It also provides aliases like “completeness” for precision and “correctness” for recall, which can be more intuitive in certain testing contexts.
- pytest_texts_score.texts_agg_completeness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_average.
- pytest_texts_score.texts_agg_completeness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_max.
- pytest_texts_score.texts_agg_completeness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_mean.
- pytest_texts_score.texts_agg_completeness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_median.
- pytest_texts_score.texts_agg_completeness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_min.
- pytest_texts_score.texts_agg_correctness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_average.
- pytest_texts_score.texts_agg_correctness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_max.
- pytest_texts_score.texts_agg_correctness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_mean.
- pytest_texts_score.texts_agg_correctness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_median.
- pytest_texts_score.texts_agg_correctness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_min.
- pytest_texts_score.texts_agg_f1_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_f1_mean.
- pytest_texts_score.texts_agg_f1_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the maximum aggregated F1 score is below an upper bound.
Performs multiple evaluation runs, calculates the maximum F1 score across all runs, and asserts that this maximum score is less than or equal to
upper_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_f1_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the mean aggregated F1 score is close to a target value.
Performs multiple evaluation runs, calculates the mean (average) F1 score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_f1_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the median aggregated F1 score is close to a target value.
Performs multiple evaluation runs, calculates the median F1 score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_f1_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the minimum aggregated F1 score is above a lower bound.
Performs multiple evaluation runs, calculates the minimum F1 score across all runs, and asserts that this minimum score is greater than or equal to
lower_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_precision_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_mean.
- pytest_texts_score.texts_agg_precision_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the maximum aggregated precision is below an upper bound.
Performs multiple evaluation runs, calculates the maximum precision score across all runs, and asserts that this maximum score is less than or equal to
upper_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_precision_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the mean aggregated precision is close to a target value.
Performs multiple evaluation runs, calculates the mean (average) precision score, and asserts that it falls within the range
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_precision_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the median aggregated precision is close to a target value.
Performs multiple evaluation runs, calculates the median precision score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_precision_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the minimum aggregated precision is above a lower bound.
Performs multiple evaluation runs, calculates the minimum precision score across all runs, and asserts that this minimum score is greater than or equal to
lower_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_recall_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_mean.
- pytest_texts_score.texts_agg_recall_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the maximum aggregated recall is below an upper bound.
Performs multiple evaluation runs, calculates the maximum recall score across all runs, and asserts that this maximum score is less than or equal to
upper_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_recall_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the mean aggregated recall is close to a target value.
Performs multiple evaluation runs, calculates the mean (average) recall score, and asserts that it falls within the range
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_recall_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the median aggregated recall is close to a target value.
Performs multiple evaluation runs, calculates the median recall score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_agg_recall_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the minimum aggregated recall is above a lower bound.
Performs multiple evaluation runs, calculates the minimum recall score across all runs, and asserts that this minimum score is greater than or equal to
lower_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_expect_completeness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Alias for texts_expect_precision_equal.
- pytest_texts_score.texts_expect_completeness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Alias for texts_expect_precision_range.
- pytest_texts_score.texts_expect_correctness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Alias for texts_expect_recall_equal.
- pytest_texts_score.texts_expect_correctness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Alias for texts_expect_recall_range.
- pytest_texts_score.texts_expect_f1_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the F1 score is close to a target value.
This is a convenience wrapper around
texts_expect_f1_range(). It performs a single F1 score evaluation and asserts that the result is withintarget ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected F1 score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_expect_f1_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the F1 score falls within a specified range.
This function performs a single evaluation of the F1 score between the
expectedandgiventexts. It then asserts that the resulting score is betweenmin_scoreandmax_score.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable F1 score.
max_score (float) – The maximum acceptable F1 score.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_expect_precision_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the precision score is close to a target value.
This is a convenience wrapper around
texts_expect_precision_range(). It performs a single precision score evaluation and asserts that the result is withintarget ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected precision score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_expect_precision_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the precision score falls within a specified range.
This function performs a single evaluation of the precision score between the
expectedandgiventexts. It then asserts that the resulting score is betweenmin_scoreandmax_score.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable precision score.
max_score (float) – The maximum acceptable precision score.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_expect_recall_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the recall score is close to a target value.
This is a convenience wrapper around
texts_expect_recall_range(). It performs a single recall score evaluation and asserts that the result is withintarget ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected recall score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.texts_expect_recall_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the recall score falls within a specified range.
This function performs a single evaluation of the recall score between the
expectedandgiventexts. It then asserts that the resulting score is betweenmin_scoreandmax_score.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable recall score.
max_score (float) – The maximum acceptable recall score.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
Submodules¶
pytest_texts_score.api module¶
- pytest_texts_score.api.MINIMAL_EXPECTED_MAX_DELTA = 0.05¶
A recommended minimum value for the max_delta or range width. Used to warn users if their test’s acceptance criteria are very strict, which might lead to flaky tests due to LLM non-determinism.
- pytest_texts_score.api.texts_agg_f1_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the maximum aggregated F1 score is below an upper bound.
Performs multiple evaluation runs, calculates the maximum F1 score across all runs, and asserts that this maximum score is less than or equal to
upper_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_f1_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the mean aggregated F1 score is close to a target value.
Performs multiple evaluation runs, calculates the mean (average) F1 score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_f1_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the median aggregated F1 score is close to a target value.
Performs multiple evaluation runs, calculates the median F1 score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_f1_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the minimum aggregated F1 score is above a lower bound.
Performs multiple evaluation runs, calculates the minimum F1 score across all runs, and asserts that this minimum score is greater than or equal to
lower_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_precision_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the maximum aggregated precision is below an upper bound.
Performs multiple evaluation runs, calculates the maximum precision score across all runs, and asserts that this maximum score is less than or equal to
upper_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_precision_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the mean aggregated precision is close to a target value.
Performs multiple evaluation runs, calculates the mean (average) precision score, and asserts that it falls within the range
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_precision_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the median aggregated precision is close to a target value.
Performs multiple evaluation runs, calculates the median precision score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_precision_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the minimum aggregated precision is above a lower bound.
Performs multiple evaluation runs, calculates the minimum precision score across all runs, and asserts that this minimum score is greater than or equal to
lower_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_recall_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the maximum aggregated recall is below an upper bound.
Performs multiple evaluation runs, calculates the maximum recall score across all runs, and asserts that this maximum score is less than or equal to
upper_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
upper_bound (float) – The maximum acceptable score for the aggregated maximum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_recall_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the mean aggregated recall is close to a target value.
Performs multiple evaluation runs, calculates the mean (average) recall score, and asserts that it falls within the range
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected mean score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_recall_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the median aggregated recall is close to a target value.
Performs multiple evaluation runs, calculates the median recall score, and asserts that it falls within the range defined by
target ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected median score.
max_delta (float) – The allowed deviation from the target. Defaults to 0.1.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_agg_recall_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Assert that the minimum aggregated recall is above a lower bound.
Performs multiple evaluation runs, calculates the minimum recall score across all runs, and asserts that this minimum score is greater than or equal to
lower_bound.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
lower_bound (float) – The minimum acceptable score for the aggregated minimum.
full_runs (int) – Number of times to generate new questions. Defaults to 5.
each_question_runs (int) – Number of times to evaluate answers per question set. Defaults to 1.
retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_expect_f1_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the F1 score is close to a target value.
This is a convenience wrapper around
texts_expect_f1_range(). It performs a single F1 score evaluation and asserts that the result is withintarget ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected F1 score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_expect_f1_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the F1 score falls within a specified range.
This function performs a single evaluation of the F1 score between the
expectedandgiventexts. It then asserts that the resulting score is betweenmin_scoreandmax_score.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable F1 score.
max_score (float) – The maximum acceptable F1 score.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_expect_precision_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the precision score is close to a target value.
This is a convenience wrapper around
texts_expect_precision_range(). It performs a single precision score evaluation and asserts that the result is withintarget ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected precision score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_expect_precision_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the precision score falls within a specified range.
This function performs a single evaluation of the precision score between the
expectedandgiventexts. It then asserts that the resulting score is betweenmin_scoreandmax_score.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable precision score.
max_score (float) – The maximum acceptable precision score.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_expect_recall_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the recall score is close to a target value.
This is a convenience wrapper around
texts_expect_recall_range(). It performs a single recall score evaluation and asserts that the result is withintarget ± max_delta.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
target (float) – The expected recall score. Defaults to 1.0.
max_delta (float) – The allowed deviation from the target. Defaults to 0.2.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
- pytest_texts_score.api.texts_expect_recall_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Assert that the recall score falls within a specified range.
This function performs a single evaluation of the recall score between the
expectedandgiventexts. It then asserts that the resulting score is betweenmin_scoreandmax_score.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
min_score (float) – The minimum acceptable recall score.
max_score (float) – The maximum acceptable recall score.
skip_warnings (bool) – If
True, suppresses input validation warnings.retry_on_error (bool) – If
True, retries LLM calls on failure.
pytest_texts_score.api_wrappers module¶
This module provides wrapper functions for the public API, offering alternative names for existing functionality. For example, functions using ‘mean’ are aliased with ‘average’.
- pytest_texts_score.api_wrappers.texts_agg_completeness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_average.
- pytest_texts_score.api_wrappers.texts_agg_completeness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_max.
- pytest_texts_score.api_wrappers.texts_agg_completeness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_mean.
- pytest_texts_score.api_wrappers.texts_agg_completeness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_median.
- pytest_texts_score.api_wrappers.texts_agg_completeness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_min.
- pytest_texts_score.api_wrappers.texts_agg_correctness_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_average.
- pytest_texts_score.api_wrappers.texts_agg_correctness_max(expected: str, given: str, upper_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_max.
- pytest_texts_score.api_wrappers.texts_agg_correctness_mean(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_mean.
- pytest_texts_score.api_wrappers.texts_agg_correctness_median(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_median.
- pytest_texts_score.api_wrappers.texts_agg_correctness_min(expected: str, given: str, lower_bound: float, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_min.
- pytest_texts_score.api_wrappers.texts_agg_f1_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_f1_mean.
- pytest_texts_score.api_wrappers.texts_agg_precision_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_precision_mean.
- pytest_texts_score.api_wrappers.texts_agg_recall_average(expected: str, given: str, target: float, max_delta: float = 0.1, full_runs: int = 5, each_question_runs: int = 1, retry_on_error: bool = True) None[source]¶
Alias for texts_agg_recall_mean.
- pytest_texts_score.api_wrappers.texts_expect_completeness_equal(expected: str, given: str, target: float = 1.0, max_delta: float = 0.2, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Alias for texts_expect_precision_equal.
- pytest_texts_score.api_wrappers.texts_expect_completeness_range(expected: str, given: str, min_score: float, max_score: float, skip_warnings: bool = False, retry_on_error: bool = True) None[source]¶
Alias for texts_expect_precision_range.
pytest_texts_score.client module¶
- pytest_texts_score.client.get_client() AzureOpenAI[source]¶
Return the initialized AzureOpenAI client.
Retrieves the globally stored
AzureOpenAIclient instance. It is designed to be called afterinit_client()has been executed, typically within a pytest fixture.- Returns:
The initialized
AzureOpenAIclient instance.- Return type:
AzureOpenAI
- Raises:
RuntimeError – If the client has not been initialized by calling
init_client()first.
- pytest_texts_score.client.init_client(config: Config) AzureOpenAI[source]¶
Initialize and store the global AzureOpenAI client.
This function uses the provided pytest configuration object to instantiate the
AzureOpenAIclient. The created client instance is stored in a global variable for later retrieval viaget_client().- Parameters:
config (pytest.Config) – The pytest config object containing LLM settings.
- Returns:
The newly created
AzureOpenAIclient instance.- Return type:
AzureOpenAI
pytest_texts_score.communication module¶
- pytest_texts_score.communication.evaluate_questions(answer_text: str, questions_text: str) list[dict[str, Any]][source]¶
Evaluate how well a text answers a list of questions using the LLM.
This function sends the
answer_textand a JSON string ofquestions_textto the configured Azure OpenAI model. The model is prompted to answer each question based on the text and provide a numeric score. The function parses the JSON response and returns the list of answers. It also handles and warns about responses that might include markdown ```json tags.- Parameters:
answer_text (str) – The text to use for answering the questions.
questions_text (str) – A JSON string representing the list of questions.
- Returns:
A list of dictionaries, where each dictionary contains a ‘question’ and its corresponding ‘answer’ score.
- Return type:
list[dict[str, Any]]
- Raises:
ValueError – If the LLM response is not valid JSON or cannot be parsed.
openai.APIError – If the API call to the LLM fails.
- pytest_texts_score.communication.make_questions(base_text: str) str[source]¶
Generate questions from a given text using the LLM.
This function sends the
base_textto the configured Azure OpenAI model with a system prompt designed to elicit factual yes/no questions. It retrieves the global configuration and client instance to make the API call.- Parameters:
base_text (str) – The text from which to generate questions.
- Returns:
A JSON string containing the generated questions. Returns an empty string if the model response content is empty.
- Return type:
str
- Raises:
openai.APIError – If the API call to the LLM fails.
pytest_texts_score.evaluate_score module¶
- class pytest_texts_score.evaluate_score.AggType(*values)[source]¶
Bases:
str,EnumAggregation types for recall scores.
- AVERAGE = 'average'¶
- MAXIMUM = 'maximum'¶
- MEAN = 'mean'¶
- MEDIAN = 'median'¶
- MINIMUM = 'minimum'¶
- pytest_texts_score.evaluate_score.MAXIMAL_RETRY_ON_ERROR = 5¶
The maximum number of times to retry an LLM call upon failure before raising an exception.
- class pytest_texts_score.evaluate_score.ScoreType(*values)[source]¶
Bases:
str,Enum- F1 = 'f1'¶
- PRECISION = 'precision'¶
- RECALL = 'recall'¶
- pytest_texts_score.evaluate_score.f1_score(precision: float, recall: float) float[source]¶
Calculate the F1 score from precision and recall.
Computes the harmonic mean of precision and recall. Returns 0 if both precision and recall are 0 to avoid division by zero.
- Parameters:
precision (float) – The precision score (between 0.0 and 1.0).
recall (float) – The recall score (between 0.0 and 1.0).
- Returns:
The F1 score.
- Return type:
float
- pytest_texts_score.evaluate_score.score_one_side(base_text: str, answer_text: str, retry_on_error: bool = True) float[source]¶
Calculate a one-sided score by generating questions from one text and answering with another.
This is a fundamental building block for both precision and recall calculations. It generates a set of questions based on
base_textand then evaluates how wellanswer_textcan answer them. The final score is the average of the answer scores. This process forms the basis for calculating both precision and recall.- Parameters:
base_text (str) – The text to generate questions from.
answer_text (str) – The text to answer the questions with.
retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
The average score from the evaluation.
- Return type:
float
- Raises:
Exception – If the operation fails after the maximum number of retries.
- pytest_texts_score.evaluate_score.scores_agg(scores: list[float], agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean']) float[source]¶
Aggregate a list of scores using a specified method.
This function takes a list of numeric scores and applies an aggregation function (min, max, median, or mean/average) to produce a single summary score.
- Parameters:
scores (list[float]) – A list of scores to aggregate.
agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use.
- Returns:
The aggregated score.
- Return type:
float
- Raises:
ValueError – If an unknown aggregation type is provided.
- pytest_texts_score.evaluate_score.texts_agg_f1(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean'], retry_on_error: bool = True) float[source]¶
Calculate an aggregated F1 score over multiple runs.
This function first generates multiple F1 scores by calling
texts_multiple_f1and then aggregates these scores using the specifiedagg_type.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
generate_questions (int) – The number of times to generate a new set of questions.
generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.
agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use on the collected scores.
retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
The final aggregated F1 score.
- Return type:
float
- pytest_texts_score.evaluate_score.texts_agg_precision(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean'], retry_on_error: bool = True) float[source]¶
Calculate an aggregated precision score over multiple runs.
This function first generates multiple precision scores by calling
texts_multiple_precisionand then aggregates these scores using the specifiedagg_type.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
generate_questions (int) – The number of times to generate a new set of questions.
generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.
agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use on the collected scores.
retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
The final aggregated precision score.
- Return type:
float
- pytest_texts_score.evaluate_score.texts_agg_recall(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, agg_type: AggType | Literal['minimum', 'maximum', 'median', 'average', 'mean'], retry_on_error: bool = True) float[source]¶
Calculate an aggregated recall score over multiple runs.
This function first generates multiple recall scores by calling
texts_multiple_recalland then aggregates these scores using the specifiedagg_type.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
generate_questions (int) – The number of times to generate a new set of questions.
generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.
agg_type (AggType | Literal["minimum", "maximum", "median", "average", "mean"]) – The aggregation method to use on the collected scores.
retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
The final aggregated recall score.
- Return type:
float
- pytest_texts_score.evaluate_score.texts_evaluate_f1(expected: str, given: str, retry_on_error: bool = True) float[source]¶
Calculate the F1 score between two texts.
This function computes the F1 score by first calculating the precision and recall between the
expectedandgiventexts. It serves as a single-run evaluation of the harmonic mean of precision and recall.- Parameters:
expected (str) – The reference text.
given (str) – The text to be evaluated against the reference.
retry_on_error (bool) – Whether to retry the LLM call on failure. Defaults to
True.
- Returns:
The calculated F1 score.
- Return type:
float
- pytest_texts_score.evaluate_score.texts_evaluate_precision(expected: str, given: str, retry_on_error: bool = True) float[source]¶
Evaluate the precision score of the given text against the expected text.
Precision is calculated by generating questions from the
giventext and checking how well they are answered by theexpectedtext. This measures how much of the information in thegiventext is also present in theexpectedtext.- Parameters:
expected (str) – The reference text used for answering questions.
given (str) – The text from which questions are generated.
retry_on_error (bool) – Whether to retry the LLM call on failure. Defaults to
True.
- Returns:
The calculated precision score.
- Return type:
float
- pytest_texts_score.evaluate_score.texts_evaluate_recall(expected: str, given: str, retry_on_error: bool = True) float[source]¶
Evaluate the recall score of the given text against the expected text.
Recall is calculated by generating questions from the
expectedtext and checking how well they are answered by thegiventext. This measures how much of the information in theexpectedtext is covered by thegiventext.- Parameters:
expected (str) – The reference text from which questions are generated.
given (str) – The text used for answering questions.
retry_on_error (bool) – Whether to retry the LLM call on failure. Defaults to
True.
- Returns:
The calculated recall score.
- Return type:
float
- pytest_texts_score.evaluate_score.texts_multiple_f1(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, score_only: bool = True, retry_on_error: bool = True) list[float] | list[tuple[int, int, float, float, float]][source]¶
Perform multiple evaluation runs to get a list of F1 scores.
This function runs the F1 score evaluation multiple times to account for variability in LLM responses. It generates new sets of questions for precision and recall in each
generate_questionsloop, and for each set, it evaluates answersgenerate_answers_per_questionstimes.- Parameters:
expected (str) – The reference text.
given (str) – The text to evaluate.
generate_questions (int) – The number of times to generate a new set of questions.
generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.
score_only (bool) – If
True, returns only a list of F1 scores. IfFalse, returns a list of tuples with detailed run info. Defaults toTrue.retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
A list of F1 scores, or a list of tuples
(question_run, answer_run, precision, recall, f1_score).- Return type:
list[float] | list[tuple[int, int, float, float, float]]
- Raises:
Exception – If the operation fails after the maximum number of retries.
- pytest_texts_score.evaluate_score.texts_multiple_precision(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, score_only: bool = True, retry_on_error: bool = True) list[float] | list[tuple[int, int, float]][source]¶
Perform multiple evaluation runs to get a list of precision scores.
This function runs the precision score evaluation multiple times. It generates new sets of questions from the
giventext in eachgenerate_questionsloop, and for each set, it evaluates answersgenerate_answers_per_questionstimes using theexpectedtext.- Parameters:
expected (str) – The reference text for answering.
given (str) – The text to generate questions from.
generate_questions (int) – The number of times to generate a new set of questions.
generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.
score_only (bool) – If
True, returns only a list of precision scores. IfFalse, returns a list of tuples with detailed run info. Defaults toTrue.retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
A list of precision scores, or a list of tuples
(question_run, answer_run, precision).- Return type:
list[float] | list[tuple[int, int, float]]
- Raises:
Exception – If the operation fails after the maximum number of retries.
- pytest_texts_score.evaluate_score.texts_multiple_recall(expected: str, given: str, generate_questions: int, generate_answers_per_questions: int, score_only: bool = True, retry_on_error: bool = True) list[float] | list[tuple[int, int, float]][source]¶
Perform multiple evaluation runs to get a list of recall scores.
This function runs the recall score evaluation multiple times. It generates new sets of questions from the
expectedtext in eachgenerate_questionsloop, and for each set, it evaluates answersgenerate_answers_per_questionstimes using thegiventext.- Parameters:
expected (str) – The reference text to generate questions from.
given (str) – The text for answering.
generate_questions (int) – The number of times to generate a new set of questions.
generate_answers_per_questions (int) – The number of times to evaluate answers for each set of questions.
score_only (bool) – If
True, returns only a list of recall scores. IfFalse, returns a list of tuples with detailed run info. Defaults toTrue.retry_on_error (bool) – Whether to retry LLM calls on failure. Defaults to
True.
- Returns:
A list of recall scores, or a list of tuples
(question_run, answer_run, recall).- Return type:
list[float] | list[tuple[int, int, float]]
- Raises:
Exception – If the operation fails after the maximum number of retries.
pytest_texts_score.plugin module¶
- pytest_texts_score.plugin.get_config() Config[source]¶
Return the initialized pytest configuration object.
Retrieves the globally stored
pytest.Configobject, which contains the resolved LLM configuration. This function should be called afterpytest_configurehas run.- Returns:
The pytest config object.
- Return type:
pytest.Config
- Raises:
RuntimeError – If the configuration has not been initialized.
- pytest_texts_score.plugin.mask_api_key(key: str | None) str | None[source]¶
Mask an API key for safe display.
Replaces all but the first character of the API key with asterisks (
*) to prevent leaking sensitive information in logs or reports.- Parameters:
key (Optional[str]) – The API key to mask.
- Returns:
The masked API key, or
Noneif the input wasNone.- Return type:
Optional[str]
- pytest_texts_score.plugin.pytest_addoption(parser: Parser) None[source]¶
Add command-line and .ini options for LLM configuration to pytest.
This hook implementation defines various options to configure the Azure OpenAI client, such as API key, endpoint, model, and other parameters. Options can be provided via the command line or a
pytest.inifile.- Parameters:
parser (pytest.Parser) – The pytest option parser.
- Returns:
None.
- pytest_texts_score.plugin.pytest_configure(config: Config) None[source]¶
Resolve LLM config (CLI > ini > default) and initialize the client.
This hook is called after command line and configuration files are parsed. It resolves the final configuration values by prioritizing command-line options over
.inifile settings, and then over default values. It validates that all required settings are present and then initializes the global LLM client.- Parameters:
config (pytest.Config) – The pytest config object.
- Returns:
None.
- Raises:
pytest.UsageError – If any required configuration values are missing.
- pytest_texts_score.plugin.pytest_report_header(config: Config) str[source]¶
Add LLM configuration details to the pytest report header.
This hook provides a custom string to be displayed in the header of the test report, showing the resolved LLM configuration parameters for the current test run. The API key is masked for security.
- Parameters:
config (pytest.Config) – The pytest config object.
- Returns:
A string to be included in the report header.
- Return type:
str
- pytest_texts_score.plugin.texts_score() dict[str, Callable][source]¶
Provide access to text comparison helper functions as a fixture.
This fixture returns a dictionary of callable functions for text scoring and evaluation. These functions include various aggregation and expectation helpers for metrics like F1 score, precision, recall, completeness, and correctness.
- Returns:
A dictionary mapping function names to callable helper functions.
- Return type:
dict[str, Callable]
Note
This fixture returns a dictionary of functions rather than exposing them globally.
pytest_texts_score.prompts module¶
Prompt templates used by pytest-texts-score. These prompts are carefully engineered to guide the LLM’s behavior for question generation and evaluation. Modifying them may have significant impacts on the scoring results.
- pytest_texts_score.prompts.get_system_answers_prompt() str[source]¶
Get the system prompt for answering questions.
This function returns the predefined system prompt that instructs the LLM on how to answer a list of questions based on a given text, using a numeric scoring system.
- Returns:
The question answering prompt string.
- Return type:
str
- pytest_texts_score.prompts.get_system_questions_prompt() str[source]¶
Get the system prompt for generating questions.
This function returns the predefined system prompt that instructs the LLM on how to generate factual yes/no questions from a given text.
- Returns:
The question generation prompt string.
- Return type:
str
- pytest_texts_score.prompts.get_user_answers_prompt(answer_text: str, questions_text: str) str[source]¶
Create a user prompt for answering questions.
This function formats the text and the questions into a single prompt that will be paired with the system answer prompt.
- Parameters:
answer_text (str) – The text to use for answering the questions.
questions_text (str) – The JSON string of questions to be answered.
- Returns:
The formatted user prompt string.
- Return type:
str
- pytest_texts_score.prompts.get_user_questions_prompt(text: str) str[source]¶
Create a user prompt for question generation.
This function formats the user-provided text into a simple prompt that will be paired with the system question prompt.
- Parameters:
text (str) – The text to generate questions from.
- Returns:
The formatted user prompt string.
- Return type:
str