Correctness
Objective: This metric checks the correctness of your LLM response compared Is the submission is factually accurate and free from errors to the expected response
Parameters:
data:
prompt
(str): The prompt for the response.response
(str): The actual response to be evaluated.expected_response
(str): The expected response for comparison.context
(str): The ground truth for comparison.
arguments:
model
(str, optional): The model to be used for evaluation (default is "gpt-3.5-turbo").threshold
(float, optional): The threshold for correctness score (default is 0.5).
Interpretation: Higher score indicates the model response was correct for the prompt. Failed result indicates the response is not factually correct compared to the expected response.
# Add tests with custom data
evaluator.add_test(
test_names=["correctness_test"],
data={
"prompt" : "Explain the concept of photosynthesis.",
"response" : "Photosynthesis is the process by which plants convert sunlight into energy through chlorophyll.",
"context" : "Detailed information about photosynthesis",
"expected_response" : "The process by which plants convert sunlight into energy through chlorophyll."
},
arguments={"model": "gpt-4", "threshold": 0.6},
).run()
evaluator.print_results()
Last updated
Was this helpful?