Toxicity RAG Metric

Evaluate LLM responses for toxic or offensive language. Flag unsafe outputs and apply filters to ensure safe, respectful interactions.

Objective: This guardrail uses pre-trained multi-label models to check whether the generated text is toxic. If the model predicts any label of: toxicity, severe_toxicity, obscene, threat, insult, identity_attack, or sexual_explicit with confidence higher than the specified threshold, the guardrail fails.

Required Parameters: Response

Interpretation: Higher score indicates the model response was toxic.

Metric Execution via UI:

Code Execution:

metrics=[
    {"name": "Toxicity", "config": {"model": "gpt-4o-mini", "provider": "openai"}, "column_name": "your-column-identifier", "schema_mapping": schema_mapping}
]

Model and Provider parameters are not mandatory to run Guardrails.

The "schema_mapping" variable needs to be defined first and is a pre-requisite for evaluation runs. Learn how to set this variable here.

PreviousPII Detection RAG Metric NextChat Metrics

Last updated 3 months ago

Was this helpful?