Data Leakage Test

The Data Leakage Test results provide insights into the presence of data leakage from the training dataset to the test/validation dataset.

Execute Test:

The code executes the Data Leakage Test:

rules = LQRules()
rules.add(metric = 'overlapping_samples', metric_threshold = 0.99)


edge_case_detection = data_leakage_test(test_session=test_session,
                                           test_name=run_name,
                                           train_dataset_name=train_dataset_name,
                                           dataset_name=field_dataset_name,
                                           type = "data_leakage",
                                           output_type="image_data",
                                           train_embed_col_name="Embedding",
                                           embed_col_name = "Embedding",
                                           rules = rules)

test_session.add(edge_case_detection)
test_session.run()
  1. Initialize Data Leakage Rules:

    • Use the LQRules() function to initialize the rules for the test.

  2. Add Rules:

    • Use the rules.add() function to add specific rules with the following parameters:

      • metric: The metric used to detect data leakage (e.g., overlapping_samples).

      • metric_threshold: The threshold for the metric, indicating the degree of overlap required to flag data leakage.

  3. Configure Test Run:

    • Define the test run configuration, including the project name, test name, and session credentials.

  4. Execute Data Leakage Test:

    • Use the data_leakage_test() function to execute the test with the following parameters:

      • test_session: The session object managing tests.

      • test_name: Name of the test run.

      • train_dataset_name: Name of the training dataset.

      • dataset_name: Name of the field dataset (test/validation dataset).

      • type: Type of test, which should be set to "data_leakage".

      • output_type: Type of output expected from the model.

      • train_embed_col_name: Name of the column containing embeddings in the training dataset.

      • embed_col_name: Name of the column containing embeddings in the field dataset.

      • rules: Predefined rules for the test.

  5. Add Test to Session:

    • Use the test_session.add() function to register the test with the test session.

  6. Run Test:

    • Use the test_session.run() function to start the execution of all tests added to the session, including the Data Leakage Test.

By following these steps, you can effectively detect data leakage from the training dataset to the test/validation dataset using the Data Leakage Test.

Interpreting Test Results for Data Leakage Test

Donut Chart

  • The donut chart displays the proportion of leaked and genuine data points detected by the test.

Venn Diagram

  • The Venn diagram illustrates the relationship between the training dataset and the test/validation dataset.

  • The overlap region in the diagram represents the data points that are present in both datasets, indicating potential data leakage.

Understand the overall extent and relationship of data leakage between the training and test/validation datasets.

Embedding View

  • The Embedding View allows users to analyse patterns and clusters in the leaked data points.

  • It provides insights into the distribution and similarity of embeddings associated with leaked data points.

Data Grid

  • The Data Grid presents detailed information about the leaked data points, including image identifiers, labels, and confidence scores.

  • Users can explore individual data points and assess the characteristics of leaked data.

Image View

  • The Image View allows users to visualise leaked data points alongside their annotations and original images.

  • It provides a detailed examination of individual leaked data points for further analysis.

Explore detailed information and visual representations of leaked data points to understand their characteristics and potential impact on model performance.

Last updated