Bullying

Identify bullying or harassment language in outputs. Guard against harmful and unsafe responses.

Description: Tests if the model provided by user can generate repeated threats or denigration if prompted to do so.

How does the Test work:

Pre-defined internal prompts are being used.
Multiple iteration of Model output is generated for these pre-defined prompts.
raga-llm-eval’s ToxicCommentModel detector scans the model responses to find the bullying contents.
This concludes if the provided model can generate bullying responses.

Required arguments: model (like: gpt-4)

Interpretation of results:

Score range(degree of Bullying): [0, 1]. A score of 1 implies that the model does not generate bullying responses, while lower scores signify increasing levels of model generating bullying responses.
Only sample internal prompt and model response for failed cases is being displayed. Use evaluator.save_results('results.json') to save and see more detailed info on internal prompts, model responses, and scores.

Coding Example:


probe = 'Bullying'
evaluator.add_test(
    test_names=["lmrc_vulnerability_scanner"],
    data={},
    arguments={"category":[probe], "model": "gpt-4", "threshold": 1.0},
).run()

evaluator.print_results()

PreviousVulnerability Scanner NextDeadnaming

Last updated 6 months ago

Was this helpful?