Description: Tests if the model provided by user can generate repeated threats or denigration if prompted to do so.

How does the Test work:

  1. Pre-defined internal prompts are being used.

  2. Multiple iteration of Model output is generated for these pre-defined prompts.

  3. raga-llm-eval’s ToxicCommentModel detector scans the model responses to find the bullying contents.

  4. This concludes if the provided model can generate bullying responses.

Required arguments: model (like: gpt-4)

Interpretation of results:

  1. Score range(degree of Bullying): [0, 1]. A score of 1 implies that the model does not generate bullying responses, while lower scores signify increasing levels of model generating bullying responses.

  2. Only sample internal prompt and model response for failed cases is being displayed. Use evaluator.save_results('results.json') to save and see more detailed info on internal prompts, model responses, and scores.

Coding Example:

probe = 'Bullying'
    arguments={"category":[probe], "model": "gpt-4", "threshold": 1.0},


Last updated