# Data Leakage Test

### Execute Test:

The following code snippet is set up to perform a Near Duplicate Detection Test, helping you to identify and address duplicate images in your dataset.

**Step 1: Define the Duplication Detection Rules**

Start by creating rules to identify what constitutes a near duplicate in your dataset.

```python
rules = DLRules()
rules.add(metric = 'overlapping_samples', metric_threshold = 0.92)


train_dataset_name = "train_dataset"
field_dataset_name = "val_dataset"


edge_case_detection = data_leakage_test(test_session=test_session,
                                           test_name="Data-Leakage-Test",
                                           train_dataset_name=train_dataset_name,
                                           dataset_name=field_dataset_name,
                                           type = "data_leakage",
                                           output_type="image_data",
                                           train_embed_col_name="imageEmbedding",
                                           embed_col_name = "imageEmbedding",
                                           rules = rules)

test_session.add(edge_case_detection)

test_session.run()

```

* `DLRules()`: Initialises the rules for the data leakage test.
  * `rules.add()`: Adds a rule for detecting duplicates:
    * `metric`: The performance metric used for detection, "similarity\_score" in this instance.
    * `metric_threshold`: The threshold for the similarity score; a value of 0.99 indicates a very high similarity, typical of near duplicates.

* `data_leakage_test()`: Configures the near duplicate detection test with the following parameters:
  * `test_session`: The session object linked to your RagaAI project.
  * `train_dataset_name`: Contains the name of your train dataset.
  * `dataset_name`: Contains the name of your field dataset.
  * `type`: The type of test, "near\_duplicates" in this case.
  * `output_type`: The expected result of the test, "near\_duplicates" here.
  * `train_embed_col_name`: The column name in your training dataset containing the embeddings used for comparison.
  * `embed_col_name`: The column name in your field dataset containing the embeddings used for comparison.
  * `rules`: The previously defined rules for data leakage test.

* `test_session.add()`: Registers the near duplicate detection test within the session.

* `test_session.run()`: Initiates the execution of all tests in the session, including the near duplicate detection test.

By following these steps, you have successfully set up and executed a Data Leakage Test on the RagaAI Testing Platform.

Post-execution, review the results to identify and remove or handle duplicates as necessary.

### Analysing Test Results

* **Overlap Assessment**: The test evaluates each image against others, assigning overlap scores.
* **Classification**: Images with a overlap score above the threshold to any other image are classified as 'failed'.

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FpcZp48Te7AM8BMyYgX30%2FScreenshot%202024-01-16%20at%2015.50.52.png?alt=media&#x26;token=51394e53-d085-4821-8094-d7d39eb6844c" alt=""><figcaption></figcaption></figure>

#### Analysing Results

* **Embedding View**: View your dataset in an interactive visual format to identify clusters of duplicates.
* **Datagrid View**: Scan through images and their pass/fail status.

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FiXT9LaIYH4RGByw25b1g%2FScreenshot%202024-01-16%20at%2015.51.39.png?alt=media&#x26;token=64b68810-646a-4969-8f80-1b49d71564ef" alt=""><figcaption></figcaption></figure>

#### Detailed Review

* **Image View**: Click on an image and view the similar datapoints in the train dataset along with the overlap scores.&#x20;

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FXLdJYev6uTJYBiIyhZtj%2FScreenshot%202024-01-16%20at%2015.52.45.png?alt=media&#x26;token=87655cbc-c3a3-43b2-9d64-258362543db7" alt=""><figcaption></figcaption></figure>

This proactive approach safeguards the model's resilience, enabling it to consistently generalize to novel, unseen data while minimizing the impact of any leaked information.

###
