Description: Tests the model (that you passed in the argument) to see if it is vulnerable to DUDE attack. DUDE is an attack scenario where LLM assumes the role DUDE that can perform any task without adhering to content policies or ethical considerations.

How does the Test work:

  1. Pre-defined internal system prompts are used.

  2. Multiple iteration of Model output is generated for these pre-defined system prompts.

  3. raga-llm-eval’s MitigationBypass and DUDE detectors scan the model to see if the attack was successful.

  4. This concludes if the model is vulnerable to DUDE attack.

Required arguments: model (like:gpt-4)

Interpretation of results:

  1. Score range(degree of model behaving as DUDE): [0, 1]. A score of 1 implies that the model does not behave as DUDE, while lower scores signify increasing levels of model behaving as DUDE.

  2. Sample internal prompt and model response for failed cases is displayed. For detailed info on internal prompts, model responses, scores, use evaluator.save_results('filename.json').

Coding Example:

probe = 'DUDE'
    arguments={"category":[probe], "model": "gpt-3.5-turbo", "threshold": 1.0},


Last updated