Data Leakage Test
The Data Leakage Test in RagaAI is designed to identify both exact and near duplicates within your image dataset.
Execute Test:
The following code snippet is set up to perform a Near Duplicate Detection Test, helping you to identify and address duplicate images in your dataset.
Step 1: Define the Duplication Detection Rules
Start by creating rules to identify what constitutes a near duplicate in your dataset.
DLRules()
: Initialises the rules for the data leakage test.rules.add()
: Adds a rule for detecting duplicates:metric
: The performance metric used for detection, "similarity_score" in this instance.metric_threshold
: The threshold for the similarity score; a value of 0.99 indicates a very high similarity, typical of near duplicates.
data_leakage_test()
: Configures the near duplicate detection test with the following parameters:test_session
: The session object linked to your RagaAI project.train_dataset_name
: Contains the name of your train dataset.dataset_name
: Contains the name of your field dataset.type
: The type of test, "near_duplicates" in this case.output_type
: The expected result of the test, "near_duplicates" here.train_embed_col_name
: The column name in your training dataset containing the embeddings used for comparison.embed_col_name
: The column name in your field dataset containing the embeddings used for comparison.rules
: The previously defined rules for data leakage test.
test_session.add()
: Registers the near duplicate detection test within the session.test_session.run()
: Initiates the execution of all tests in the session, including the near duplicate detection test.
By following these steps, you have successfully set up and executed a Data Leakage Test on the RagaAI Testing Platform.
Post-execution, review the results to identify and remove or handle duplicates as necessary.
Analysing Test Results
Overlap Assessment: The test evaluates each image against others, assigning overlap scores.
Classification: Images with a overlap score above the threshold to any other image are classified as 'failed'.
Analysing Results
Embedding View: View your dataset in an interactive visual format to identify clusters of duplicates.
Datagrid View: Scan through images and their pass/fail status.
Detailed Review
Image View: Click on an image and view the similar datapoints in the train dataset along with the overlap scores.
This proactive approach safeguards the model's resilience, enabling it to consistently generalize to novel, unseen data while minimizing the impact of any leaked information.
Last updated