Near Duplicates Detection

The Near Duplicate Detection Test in RagaAI is designed to identify both exact and near duplicates within your image dataset.

Execute Test:

The following code snippet is set up to perform a Near Duplicate Detection Test, helping you to identify and address duplicate images in your dataset.

Step 1: Define the Duplication Detection Rules

Start by creating rules to identify what constitutes a near duplicate in your dataset.

rules = LQRules()
rules.add(metric="similarity_score", metric_threshold=0.99)

near_duplicates_detection = nearest_duplicate(test_session=test_session,
                                          dataset_name = "Enter-your-dataset-name",
                                          test_name = "near_duplicate_detection_1",
                                          type = "near_duplicates",
                                          output_type="near_duplicates",
                                          embed_col_name="embedding",
                                          rules=rules)
                                          
test_session.add()

test_session.run()
  • LQRules(): Initialises the rules for the near duplicate detection.

  • rules.add(): Adds a rule for detecting duplicates:

    • metric: The performance metric used for detection, "similarity_score" in this instance.

    • metric_threshold: The threshold for the similarity score; a value of 0.99 indicates a very high similarity, typical of near duplicates.

  • nearest_duplicate(): Configures the near duplicate detection test with the following parameters:

    • test_session: The session object linked to your RagaAI project.

    • dataset_name: The name of your dataset, replace "Enter-your-dataset-name" with the actual name.

    • type: The type of test, "near_duplicates" in this case.

    • output_type: The expected result of the test, "near_duplicates" here.

    • embed_col_name: The column name in your dataset containing the embeddings used for comparison.

    • rules: The ruleset you've defined for measuring similarity and detecting duplicates.

test_session.add(): Registers the near duplicate detection test within the session.

test_session.run(): Initiates the execution of all tests in the session, including the near duplicate detection test.

By following these steps, you have successfully set up and executed a Near Duplicate Detection Test in RagaAI.

Post-execution, review the results to identify and remove or handle duplicates as necessary.

Analysing Test Results

  • Similarity Assessment: The test evaluates each image against others, assigning similarity scores.

  • Classification: Images with a similarity score above the threshold to any other image are classified as 'failed'.

Analysing Results

  • Embedding View: View your dataset in an interactive visual format to identify clusters of duplicates.

  • Datagrid View: Scan through images and their pass/fail status.

Detailed Review

  • Image View: Click on an image to see its near duplicates and their similarity scores in detail.

By following these steps, you can ensure that your dataset is free from unwanted duplications, refining the quality and diversity of your image data.

Last updated