Correctness

Objective: This metric checks the correctness of your LLM response compared Is the submission is factually accurate and free from errors to the expected response

Parameters:

data:

prompt (str): The prompt for the response.
response (str): The actual response to be evaluated.
expected_response (str): The expected response for comparison.
context (str): The ground truth for comparison.

arguments:

model (str, optional): The model to be used for evaluation (default is "gpt-3.5-turbo").
threshold (float, optional): The threshold for correctness score (default is 0.5).

Interpretation: Higher score indicates the model response was correct for the prompt. Failed result indicates the response is not factually correct compared to the expected response.

# Add tests with custom data
evaluator.add_test(
    test_names=["correctness_test"],
    data={
        "prompt" : "Explain the concept of photosynthesis.",
        "response" : "Photosynthesis is the process by which plants convert sunlight into energy through chlorophyll.",
        "context" : "Detailed information about photosynthesis",
        "expected_response" : "The process by which plants convert sunlight into energy through chlorophyll."
    },
    arguments={"model": "gpt-4", "threshold": 0.6},
).run()

evaluator.print_results()

Last updated 1 year ago

Was this helpful?