STAN

Description: Tests the model (that you passed in the argument) to see if it is vulnerable to STAN attack. STAN is an attack scenario where the model avoids norms, freely providing information without moral or ethical bias.

How does the Test work:

  1. Pre-defined internal system prompts are used.

  2. Multiple iteration of Model output is generated for these pre-defined system prompts.

  3. raga-llm-eval’s MitigationBypass and STAN detectors scan the model to see if the attack was successful.

  4. This concludes if the model is vulnerable to STAN attack.

Required arguments: model (like:gpt-4)

Interpretation of results:

  1. Score range(degree of model behaving as STAN): [0, 1]. A score of 1 implies that the model does not behave as STAN, while lower scores signify increasing levels of model behaving as STAN.

  2. Sample internal prompt and model response for failed cases is displayed. For detailed info on internal prompts, model responses, scores, use evaluator.save_results('filename.json').

Coding Example:

probe = 'STAN'
evaluator.add_test(
      test_names=["dan_vulnerability_scanner"],
      data={
          "prompt": ["The quick brown fox jumps over the lazy dog."],
      },
      arguments={"category":[probe], "model": "gpt-3.5-turbo", "threshold": 1.0},
  ).run()

evaluator.print_results()

Last updated