LogoLogo
Slack CommunityCatalyst Login
  • Welcome
  • RagaAI Catalyst
    • User Quickstart
    • Concepts
      • Configure Your API Keys
      • Supported LLMs
        • OpenAI
        • Gemini
        • Azure
        • AWS Bedrock
        • ANTHROPIC
      • Catalyst Access/Secret Keys
      • Enable Custom Gateway
      • Uploading Data
        • Create new project
        • RAG Datset
        • Chat Dataset
          • Prompt Format
        • Logging traces (LlamaIndex, Langchain, etc.)
        • Trace Masking Functions
        • Trace Level Metadata
        • Correlating Traces with External IDs
        • Add Dataset
      • Running RagaAI Evals
        • Executing Evaluations
        • Compare Datasets
      • Analysis
      • Embeddings
    • RagaAI Metric Library
      • RAG Metrics
        • Hallucination
        • Faithfulness
        • Response Correctness
        • Response Completeness
        • False Refusal
        • Context Relevancy
        • Context Precision
        • Context Recall
        • PII Detection
        • Toxicity
      • Chat Metrics
        • Agent Quality
        • Instruction Adherence
        • User Chat Quality
      • Text-to-SQL
        • SQL Response Correctness
        • SQL Prompt Ambiguity
        • SQL Context Ambiguity
        • SQL Context Sufficiency
        • SQL Prompt Injection
      • Text Summarization
        • Summary Consistency
        • Summary Relevance
        • Summary Fluency
        • Summary Coherence
        • SummaC
        • QAG Score
        • ROUGE
        • BLEU
        • METEOR
        • BERTScore
      • Information Extraction
        • MINEA
        • Subjective Question Correction
        • Precision@K
        • Chunk Relevance
        • Entity Co-occurrence
        • Fact Entropy
      • Code Generation
        • Functional Correctness
        • ChrF
        • Ruby
        • CodeBLEU
        • Robust Pass@k
        • Robust Drop@k
        • Pass-Ratio@n
      • Marketing Content Evaluation
        • Engagement Score
        • Misattribution
        • Readability
        • Topic Coverage
        • Fabrication
      • Learning Management System
        • Topic Coverage
        • Topic Redundancy
        • Question Redundancy
        • Answer Correctness
        • Source Citability
        • Difficulty Level
      • Additional Metrics
        • Guardrails
          • Anonymize
          • Deanonymize
          • Ban Competitors
          • Ban Substrings
          • Ban Topics
          • Code
          • Invisible Text
          • Language
          • Secret
          • Sentiment
          • Factual Consistency
          • Language Same
          • No Refusal
          • Reading Time
          • Sensitive
          • URL Reachability
          • JSON Verify
        • Vulnerability Scanner
          • Bullying
          • Deadnaming
          • SexualContent
          • Sexualisation
          • SlurUsage
          • Profanity
          • QuackMedicine
          • DAN 11
          • DAN 10
          • DAN 9
          • DAN 8
          • DAN 7
          • DAN 6_2
          • DAN 6_0
          • DUDE
          • STAN
          • DAN_JailBreak
          • AntiDAN
          • ChatGPT_Developer_Mode_v2
          • ChatGPT_Developer_Mode_RANTI
          • ChatGPT_Image_Markdown
          • Ablation_Dan_11_0
          • Anthropomorphisation
      • Guardrails
        • Competitor Check
        • Gibberish Check
        • PII
        • Regex Check
        • Response Evaluator
        • Toxicity
        • Unusual Prompt
        • Ban List
        • Detect Drug
        • Detect Redundancy
        • Detect Secrets
        • Financial Tone Check
        • Has Url
        • HTML Sanitisation
        • Live URL
        • Logic Check
        • Politeness Check
        • Profanity Check
        • Quote Price
        • Restrict Topics
        • SQL Predicates Guard
        • Valid CSV
        • Valid JSON
        • Valid Python
        • Valid Range
        • Valid SQL
        • Valid URL
        • Cosine Similarity
        • Honesty Detection
        • Toxicity Hate Speech
    • Prompt Playground
      • Concepts
      • Single-Prompt Playground
      • Multiple Prompt Playground
      • Run Evaluations
      • Using Prompt Slugs with Python SDK
      • Create with AI using Prompt Wizard
      • Prompt Diff View
    • Synthetic Data Generation
    • Gateway
      • Quickstart
    • Guardrails
      • Quickstart
      • Python SDK
    • RagaAI Whitepapers
      • RagaAI RLEF (RAG LLM Evaluation Framework)
    • Agentic Testing
      • Quickstart
      • Concepts
        • Tracing
          • Langgraph (Agentic Tracing)
          • RagaAI Catalyst Tracing Guide for Azure OpenAI Users
        • Dynamic Tracing
        • Application Workflow
      • Create New Dataset
      • Metrics
        • Hallucination
        • Toxicity
        • Honesty
        • Cosine Similarity
      • Compare Traces
      • Compare Experiments
      • Add metrics locally
    • Custom Metric
    • Auto Prompt Optimization
    • Human Feedback & Annotations
      • Thumbs Up/Down
      • Add Metric Corrections
      • Corrections as Few-Shot Examples
      • Tagging
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
      • Evaluation Deployment Guide
        • Evaluation Maintenance Guide
    • Fine Tuning (OpenAI)
    • Integration
    • SDK Release Notes
      • ragaai-catalyst 2.1.7
  • RagaAI Prism
    • Quickstart
    • Sandbox Guide
      • Object Detection
      • LLM Summarization
      • Semantic Segmentation
      • Tabular Data
      • Super Resolution
      • OCR
      • Image Classification
      • Event Detection
    • Test Inventory
      • Object Detection
        • Failure Mode Analysis
        • Model Comparison Test
        • Drift Detection
        • Outlier Detection
        • Data Leakage Test
        • Labelling Quality Test
        • Scenario Imbalance
        • Class Imbalance
        • Active Learning
        • Image Property Drift Detection
      • Large Language Model (LLM)
        • Failure Mode Analysis
      • Semantic Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Active Learning
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Data Leakage Test
        • Outlier Detection
        • Label Drift
        • Semantic Similarity
        • Near Duplicates Detection
        • Cluster Imbalance Test
        • Image Property Drift Detection
        • Spatio-Temporal Drift Detection
        • Spatio-Temporal Failure Mode Analysis
      • Tabular Data
        • Failure Mode Analysis
      • Instance Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Label Drift
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Near Duplicates Detection
      • Super Resolution
        • Semantic Similarity
        • Active Learning
        • Near Duplicates Detection
        • Outlier Detection
      • OCR
        • Missing Value Test
        • Outlier Detection
      • Image Classification
        • Failure Mode Analysis
        • Labelling Quality Test
        • Class Imbalance
        • Drift Detection
        • Near Duplicates Test
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Image Property Drift Detection
      • Event Detection
        • Failure Mode Analysis
        • A/B Test
    • Metric Glossary
    • Upload custom model
    • Event Detection
      • Upload Model
      • Generate Inference
      • Run tests
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
  • Support
Powered by GitBook
On this page
  • Execute Test:
  • Analysing Test Results

Was this helpful?

  1. RagaAI Prism
  2. Test Inventory
  3. Image Classification

Labelling Quality Test

The Labelling Quality Test highlights data points with a higher probability of labelling errors. By setting a threshold on the provided mistake score metric, you can identify and rectify labelling inaccuracies.

Execute Test:

The following code snippet is aimed at performing a Labelling Quality Test on a specified dataset to evaluate the quality of the labelling done.

rules = LQRules()
rules.add(metric="mistake_score", label=["ALL"], metric_threshold=0.72)
edge_case_detection = labelling_quality_test(
                                test_session=test_session,
                                dataset_name="image_classification_lq_train",
                                test_name="Labeling Quality Test",
                                type="labelling_consistency",
                                output_type="image_classification",
                                mistake_score_col_name="MistakeScore",
                                embedding_col_name="ImageVectorsM1",
                                rules=rules,
                                )
test_session.add(edge_case_detection)
test_session.run()
  • LQRules():Initialises the rules for the Labelling Quality test specifically tailored for image classification.

  • rules.add():Adds a rule for evaluating the quality of labelling.

    • metric: Specifies the performance metric to evaluate, here it is "mistake_score". This metric likely measures the error or inconsistency in labelling.

    • label: Specifies the label(s) these metrics apply to. "ALL" means all labels are included in this evaluation.

    • metric_threshold: The minimum acceptable value for the specified metric. Here, a mistake score must be less than or equal to 0.72.

  • labelling_quality_test():Sets up the labelling quality test for the image classification model.

    • dataset_name: The name of the dataset for the labelling quality test.

    • test_name: A unique identifier for this test.

    • type: Specifies the type of quality test, here "labelling_consistency", focusing on how consistent the labelling is.

    • output_type: Indicates the type of model output being evaluated, which is "image_classification" in this case.

    • mistake_score_col_name: The column name in the dataset that contains the mistake scores for each label.

    • embedding_col_name: The column containing embedding vectors of images, used for analyses that require understanding the semantic space of the images.

    • rules: The set of labelling quality rules defined at the beginning.

  • test_session.add(): Registers the labelling quality test with the session.

  • test_session.run(): Starts the execution of all tests in the session, including your labelling quality test.

Analysing Test Results

Understanding Mistake Score

  • Mistake Score Metric: A quantitative measure indicating the likelihood of labelling errors in your dataset.

Test Overview

  • Pie Chart Overview: Shows the proportion of labels that passed or failed based on the Mistake Score threshold.

Mistake Score Distribution

  • Bar Graph Visualisation: Displays average Mistake Scores for failed labels, class-wise, and the volume of failed data points per class.

Interpreting Results

  • Passed Data Points: Identified by meeting the Mistake Score threshold, indicating accurate labelling.

  • Failed Data Points: Exceeding the threshold, suggesting potential labelling inaccuracies.

Visualisation and Assessment

  • Visualising Annotations: Arranges images by descending Mistake Score for label assessment.

Image View

  • In-Depth Analysis: Analyse Mistake Scores for each label in an image, with interactive features for annotations and original image viewing.

  • Information Card: Provides details like Mistake Score, threshold, and confidence score for each label.

By following these steps, you can effectively utilise the Labelling Quality Test to identify and address labelling inaccuracies in your datasets, enhancing the overall quality and reliability of your models.

PreviousFailure Mode AnalysisNextClass Imbalance

Last updated 1 year ago

Was this helpful?