# Data Leakage Test

### Execute Test:

The following code snippet is set up to perform a Near Duplicate Detection Test, helping you to identify and address duplicate images in your dataset.

**Step 1: Define the Duplication Detection Rules**

Start by creating rules to identify what constitutes a near duplicate in your dataset.

```python
rules = DLRules()
rules.add(metric = 'overlapping_samples', metric_threshold = 0.92)


train_dataset_name = "train_dataset"
field_dataset_name = "val_dataset"


edge_case_detection = data_leakage_test(test_session=test_session,
                                           test_name="Data-Leakage-Test",
                                           train_dataset_name=train_dataset_name,
                                           dataset_name=field_dataset_name,
                                           type = "data_leakage",
                                           output_type="image_data",
                                           train_embed_col_name="imageEmbedding",
                                           embed_col_name = "imageEmbedding",
                                           rules = rules)

test_session.add(edge_case_detection)

test_session.run()

```

* `DLRules()`: Initialises the rules for the data leakage test.
  * `rules.add()`: Adds a rule for detecting duplicates:
    * `metric`: The performance metric used for detection, "similarity\_score" in this instance.
    * `metric_threshold`: The threshold for the similarity score; a value of 0.99 indicates a very high similarity, typical of near duplicates.

* `data_leakage_test()`: Configures the near duplicate detection test with the following parameters:
  * `test_session`: The session object linked to your RagaAI project.
  * `train_dataset_name`: Contains the name of your train dataset.
  * `dataset_name`: Contains the name of your field dataset.
  * `type`: The type of test, "near\_duplicates" in this case.
  * `output_type`: The expected result of the test, "near\_duplicates" here.
  * `train_embed_col_name`: The column name in your training dataset containing the embeddings used for comparison.
  * `embed_col_name`: The column name in your field dataset containing the embeddings used for comparison.
  * `rules`: The previously defined rules for data leakage test.

* `test_session.add()`: Registers the near duplicate detection test within the session.

* `test_session.run()`: Initiates the execution of all tests in the session, including the near duplicate detection test.

By following these steps, you have successfully set up and executed a Data Leakage Test on the RagaAI Testing Platform.

Post-execution, review the results to identify and remove or handle duplicates as necessary.

### Analysing Test Results

* **Overlap Assessment**: The test evaluates each image against others, assigning overlap scores.
* **Classification**: Images with a overlap score above the threshold to any other image are classified as 'failed'.

<figure><img src="/files/eW8zVza5LK62pamEkFvV" alt=""><figcaption></figcaption></figure>

#### Analysing Results

* **Embedding View**: View your dataset in an interactive visual format to identify clusters of duplicates.
* **Datagrid View**: Scan through images and their pass/fail status.

<figure><img src="/files/HtxCv2VJRgQPDCQGBb6t" alt=""><figcaption></figcaption></figure>

#### Detailed Review

* **Image View**: Click on an image and view the similar datapoints in the train dataset along with the overlap scores.&#x20;

<figure><img src="/files/2bdh0mTsn4334P9RJHve" alt=""><figcaption></figcaption></figure>

This proactive approach safeguards the model's resilience, enabling it to consistently generalize to novel, unseen data while minimizing the impact of any leaked information.

###


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.raga.ai/ragaai-prism/test-inventory/object-detection/data-leakage-test.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
