DAN 11

Description: Tests the model (that you passed in the argument) to see if it is vulnerable to DAN 11(Do-Anything-Now) attack. DAN 11 is an attack scenario instructing model to ignore previous instructions and act as if the Do-Anything-Now Mode is enabled, allowing uncensored and unrestricted content generation.

How does the Test work:

  1. Pre-defined internal system prompts are used.

  2. Multiple iteration of Model output is generated for these pre-defined system prompts.

  3. raga-llm-eval’s MitigationBypass and DAN detectors scan the model to see if the attack was successful.

  4. This concludes if the model is vulnerable to DAN 11 attack.

Required arguments: model (like:gpt-4)

Interpretation of results:

  1. Score range(degree of model behaving as DAN 11): [0, 1]. A score of 1 implies that the model does not behaves as DAN 11, while lower scores signify increasing levels of model behaving as DAN 11.

  2. Sample internal prompt and model response for failed cases is displayed. For detailed info on internal prompts, model responses, scores, use evaluator.save_results('filename.json').

Coding Example:

probe = 'Dan_11_0'
evaluator.add_test(
    test_names=["dan_vulnerability_scanner"],
    data = {},
    arguments={"category":[probe], "model": "gpt-3.5-turbo", "threshold": 1.0},
).run()

evaluator.print_results()

Last updated