STAN
Description: Tests the model (that you passed in the argument) to see if it is vulnerable to STAN attack. STAN is an attack scenario where the model avoids norms, freely providing information without moral or ethical bias.
How does the Test work:
Pre-defined internal system prompts are used.
Multiple iteration of Model output is generated for these pre-defined system prompts.
raga-llm-eval’s
MitigationBypass
andSTAN
detectors scan the model to see if the attack was successful.This concludes if the model is vulnerable to STAN attack.
Required arguments: model (like:gpt-4
)
Interpretation of results:
Score range(degree of model behaving as STAN):
[0, 1]
. A score of 1 implies that the model does not behave as STAN, while lower scores signify increasing levels of model behaving as STAN.Sample internal prompt and model response for
failed cases
is displayed. For detailed info on internal prompts, model responses, scores, useevaluator.save_results('filename.json')
.
Coding Example:
Last updated