BLEU

Bilingual Evaluation Understudy

Objective:

BLEU measures the overlap of n-grams between machine-generated text and one or more reference texts. It calculates precision for n-gram matches, penalizing shorter outputs through a brevity penalty. BLEU emphasizes precision over recall, focusing on how much of the generated output matches the reference text. Although initially designed for machine translation, it is also used in various text generation tasks. However, its precision-focused approach may overlook some nuanced aspects of language generation, such as fluency and semantic accuracy.

Required Columns in Dataset:

LLM Summary, Reference Document (GT)

Interpretation:

High BLEU: Represents strong n-gram precision with a good match between generated and reference text, emphasizing close word-for-word similarity.
Low BLEU: Suggests insufficient n-gram overlap, which could indicate poor precision or significant deviation from the reference text.

Execution via UI:

BLEU does not require an LLM for computation.

Execution via SDK:

metrics=[
    {"name": "BLEU", "column_name": "your-text", "schema_mapping": schema_mapping}
]

Last updated 9 months ago

Was this helpful?