Cluster Imbalance Test

The Cluster Imbalance Test in RagaAI is designed to assess the distribution of data points within clusters to identify imbalances and address bias.

Execute Test:

rules = SBRules()
rules.add(metric="js_divergence", ideal_distribution="uniform", metric_threshold=0.1)
rules.add(metric="chi_squared_test", ideal_distribution="uniform", metric_threshold=0.1)

cls_default = clustering(test_session=test_session,
                         dataset_name=dataset_name,
                         method="k-means",
                         embedding_col="ImageEmbedding",
                         level="image",
                         args={"numOfClusters": 8}
                         )

edge_case_detection = cluster_imbalance(test_session=test_session,
                                        dataset_name=dataset_name,
                                        test_name="Cluster_Imbalance",
                                        type="cluster_imbalance",
                                        output_type="cluster",
                                        embedding="ImageEmbedding",
                                        rules=rules,
                                        clustering=cls_default
                                        )
test_session.add(edge_case_detection)
test_session.run()

Initialize Cluster Imbalance Rules:

  • Use the SBRules() function to initialize the rules for the test.

  • Add Rules:

    • Use the rules.add() function to add specific rules with the following parameters:

      • metric: The metric used to evaluate distribution across clusters (e.g., js_divergence, chi_squared_test).

      • ideal_distribution: The assumed ideal distribution for the metric (e.g., "uniform").

      • metric_threshold: The threshold at which the cluster distribution is considered imbalanced.

      Configure Clustering:

      • Perform clustering on the dataset to group similar data points together using the desired method and parameters.

      • Use the clustering() function with parameters such as method, embedding_col, level, and args.

      Execute Test:

      • Use the cluster_imbalance() function to execute the test with the following parameters:

        • test_session: The session object managing tests.

        • dataset_name: Name of the dataset to be tested.

        • test_name: Name of the test run.

        • type: Type of test, which should be set to "cluster_imbalance".

        • output_type: Type of output expected from the model.

        • embedding: Name of the column containing embedding vectors.

        • rules: Predefined rules for the test.

      Add Test to Session:

      • Use the test_session.add() function to register the test within the session.

      Run Test:

      • Use the test_session.run() function to execute all tests added to the session, including the Cluster Imbalance Test.

    By following these steps, you have successfully set up and executed a Cluster Imbalance Test on the RagaAI Testing Platform.

Post-execution, review the results to identify and remove or handle duplicates as necessary.

Analysing Test Results:

The Cluster Imbalance Test offers valuable insights into how evenly data points are distributed across clusters within the dataset. The results help identify any significant imbalances that could potentially bias the model’s outcomes. The results are presented in three segments:

Understanding Clustering:

  • Cluster Analysis: RagaAI utilizes clustering to group similar data points together, leveraging embeddings from the dataset.

  • Identifying Imbalanced Distribution: Clusters with high imbalance scores indicate that data points are unevenly distributed leading to biased results.

Interpreting Results:

  • Embedding View: Use this interactive feature to visualize how data points are distributed among clusters.

  • Data Grid View: Helps visualise annotations with images sorted by cluster imbalance scores.

  • Image View: Explore in-depth analyses for each image.

By utilizing these detailed and structured steps, the Cluster Imbalance Test is a robust tool for ensuring that the distribution of clusters is balanced, thereby enhancing the integrity and fairness of machine learning models.

Last updated