Anonymize

Anonymize Personally Identifiable Information(PII) data in the text using NLP (English only) and predefined regex patterns. Anonymizes detected entities with placeholders like [REDACTED_PERSON_1] and stores the real values in a Vault.

PII entities

  • Credit Cards: Formats mentioned in Wikipedia.

    • 4111111111111111

    • 378282246310005 (American Express)

    • 30569309025904 (Diners Club)

  • Person: A full person name, which can include first names, middle names or initials, and last names.

    • John Doe

  • PHONE_NUMBER:

    • 5555551234

  • URL: A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet.

    • https://example.com/

  • E-mail Addresses: Standard email formats.

    • john.doe@example.com

    • john.doe[AT]example[DOT]com

    • john.doe[AT]example.com

    • john.doe@example[DOT]com

  • IPs: An Internet Protocol (IP) address (either IPv4 or IPv6).

    • 192.168.1.1 (IPv4)

    • 2001:db8:3333:4444:5555:6666:7777:8888 (IPv6)

  • UUID:

    • 550e8400-e29b-41d4-a716-446655440000

  • US Social Security Number (SSN):

    • 111-22-3333

  • Crypto wallet number: Currently only Bitcoin address is supported.

    • 1Lbcfr7sAHTD9CgdQo3HTMTkV8LK4ZnX71

  • IBAN Code: The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.

    • DE89370400440532013000

Parameters:

data:

  • prompt (str): The text to be anonymized.

arguments:

  • hidden_names (Optional[Sequence[str]]): List of names to be anonymized e.g. [REDACTED_CUSTOM_1].

  • allowed_names (Optional[Sequence[str]]): List of names allowed in the text without anonymizing.

  • entity_types (Optional[Sequence[str]]): List of entity types to be detected. If not provided, defaults to all.

  • preamble (str): Text to prepend to sanitized prompt. If not provided, defaults to an empty string.

  • regex_patterns (Optional[List[Dict]]): List of regex patterns for additional custom anonymization.

  • use_faker (bool): Whether to use faker instead of placeholders in applicable cases. If not provided, defaults to False, replaces with placeholders [REDACTED_PERSON_1].

  • threshold (float): Acceptance threshold. Default is 0.

  • use_onnx (bool): Whether to use ONNX runtime for inference. Default is False.

  • language (str): Language of the anonymize detect. Default is "en".

Interpretation:

Personally Identifiable Information(PII) Data such as name, email, phone number,etc. present in the prompt should be removed in the output

Example:

prompt = "Your name is Walter White, You have received your inquiry regarding an account with Google LLC from John Doe, where you work.
Your registered email address is walter.white@google.com and the phone number is +91-9967854332 associated with the account. Share this information with the complainant so that the person can be assisted further.
Write an email to help the person ahead"

sanitized_prompt = "Your name is [REDACTED_PERSON_1], You have received your inquiry regarding an account with Google LLC from [REDACTED_PERSON_2], where you work.\nYour registered email address is [REDACTED_EMAIL_ADDRESS_1] and the phone number is [REDACTED_PHONE_NUMBER_1] associated with the account. Share this information with the complainant so that the person can be assisted further.\nWrite an email to help the person ahead\n' credit card [REDACTED_CREDIT_CARD_RE_1]"

Code Example:

evaluator.add_test(
    test_names=["anonymize_guardrail"],
    data={
        "prompt": """Your name is Walter White, You have received your inquiry regarding an account with Google LLC from John Doe, where you work.
Your registered email address is walter.white@google.com and the phone number is +91-9967854332 associated with the account. Share this information with the complainant so that the person can be assisted further.
Write an email to help the person ahead
""",
    },
).run()

evaluator.print_results()

Last updated