LogoLogo
Slack CommunityCatalyst Login
  • Welcome
  • RagaAI Catalyst
    • User Quickstart
    • Concepts
      • Configure Your API Keys
      • Supported LLMs
        • OpenAI
        • Gemini
        • Azure
        • AWS Bedrock
        • ANTHROPIC
      • Catalyst Access/Secret Keys
      • Enable Custom Gateway
      • Uploading Data
        • Create new project
        • RAG Datset
        • Chat Dataset
          • Prompt Format
        • Logging traces (LlamaIndex, Langchain)
        • Trace Masking Functions
        • Trace Level Metadata
        • Correlating Traces with External IDs
        • Add Dataset
      • Running RagaAI Evals
        • Executing Evaluations
        • Compare Datasets
      • Analysis
      • Embeddings
    • RagaAI Metric Library
      • RAG Metrics
        • Hallucination
        • Faithfulness
        • Response Correctness
        • Response Completeness
        • False Refusal
        • Context Relevancy
        • Context Precision
        • Context Recall
        • PII Detection
        • Toxicity
      • Chat Metrics
        • Agent Quality
        • Instruction Adherence
        • User Chat Quality
      • Text-to-SQL
        • SQL Response Correctness
        • SQL Prompt Ambiguity
        • SQL Context Ambiguity
        • SQL Context Sufficiency
        • SQL Prompt Injection
      • Text Summarization
        • Summary Consistency
        • Summary Relevance
        • Summary Fluency
        • Summary Coherence
        • SummaC
        • QAG Score
        • ROUGE
        • BLEU
        • METEOR
        • BERTScore
      • Information Extraction
        • MINEA
        • Subjective Question Correction
        • Precision@K
        • Chunk Relevance
        • Entity Co-occurrence
        • Fact Entropy
      • Code Generation
        • Functional Correctness
        • ChrF
        • Ruby
        • CodeBLEU
        • Robust Pass@k
        • Robust Drop@k
        • Pass-Ratio@n
      • Marketing Content Evaluation
        • Engagement Score
        • Misattribution
        • Readability
        • Topic Coverage
        • Fabrication
      • Learning Management System
        • Topic Coverage
        • Topic Redundancy
        • Question Redundancy
        • Answer Correctness
        • Source Citability
        • Difficulty Level
      • Additional Metrics
        • Guardrails
          • Anonymize
          • Deanonymize
          • Ban Competitors
          • Ban Substrings
          • Ban Topics
          • Code
          • Invisible Text
          • Language
          • Secret
          • Sentiment
          • Factual Consistency
          • Language Same
          • No Refusal
          • Reading Time
          • Sensitive
          • URL Reachability
          • JSON Verify
        • Vulnerability Scanner
          • Bullying
          • Deadnaming
          • SexualContent
          • Sexualisation
          • SlurUsage
          • Profanity
          • QuackMedicine
          • DAN 11
          • DAN 10
          • DAN 9
          • DAN 8
          • DAN 7
          • DAN 6_2
          • DAN 6_0
          • DUDE
          • STAN
          • DAN_JailBreak
          • AntiDAN
          • ChatGPT_Developer_Mode_v2
          • ChatGPT_Developer_Mode_RANTI
          • ChatGPT_Image_Markdown
          • Ablation_Dan_11_0
          • Anthropomorphisation
      • Guardrails
        • Competitor Check
        • Gibberish Check
        • PII
        • Regex Check
        • Response Evaluator
        • Toxicity
        • Unusual Prompt
        • Ban List
        • Detect Drug
        • Detect Redundancy
        • Detect Secrets
        • Financial Tone Check
        • Has Url
        • HTML Sanitisation
        • Live URL
        • Logic Check
        • Politeness Check
        • Profanity Check
        • Quote Price
        • Restrict Topics
        • SQL Predicates Guard
        • Valid CSV
        • Valid JSON
        • Valid Python
        • Valid Range
        • Valid SQL
        • Valid URL
        • Cosine Similarity
        • Honesty Detection
        • Toxicity Hate Speech
    • Prompt Playground
      • Concepts
      • Single-Prompt Playground
      • Multiple Prompt Playground
      • Run Evaluations
      • Using Prompt Slugs with Python SDK
      • Create with AI using Prompt Wizard
      • Prompt Diff View
    • Synthetic Data Generation
    • Gateway
      • Quickstart
    • Guardrails
      • Quickstart
      • Python SDK
    • RagaAI Whitepapers
      • RagaAI RLEF (RAG LLM Evaluation Framework)
    • Agentic Testing
      • Quickstart
      • Concepts
        • Tracing
          • Langgraph (Agentic Tracing)
          • RagaAI Catalyst Tracing Guide for Azure OpenAI Users
        • Dynamic Tracing
        • Application Workflow
      • Create New Dataset
      • Metrics
        • Hallucination
        • Toxicity
        • Honesty
        • Cosine Similarity
      • Compare Traces
      • Compare Experiments
      • Add metrics locally
    • Custom Metric
    • Auto Prompt Optimization
    • Human Feedback & Annotations
      • Thumbs Up/Down
      • Add Metric Corrections
      • Corrections as Few-Shot Examples
      • Tagging
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
      • Evaluation Deployment Guide
        • Evaluation Maintenance Guide
    • Fine Tuning (OpenAI)
    • Integration
    • SDK Release Notes
      • ragaai-catalyst 2.1.7
  • RagaAI Prism
    • Quickstart
    • Sandbox Guide
      • Object Detection
      • LLM Summarization
      • Semantic Segmentation
      • Tabular Data
      • Super Resolution
      • OCR
      • Image Classification
      • Event Detection
    • Test Inventory
      • Object Detection
        • Failure Mode Analysis
        • Model Comparison Test
        • Drift Detection
        • Outlier Detection
        • Data Leakage Test
        • Labelling Quality Test
        • Scenario Imbalance
        • Class Imbalance
        • Active Learning
        • Image Property Drift Detection
      • Large Language Model (LLM)
        • Failure Mode Analysis
      • Semantic Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Active Learning
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Data Leakage Test
        • Outlier Detection
        • Label Drift
        • Semantic Similarity
        • Near Duplicates Detection
        • Cluster Imbalance Test
        • Image Property Drift Detection
        • Spatio-Temporal Drift Detection
        • Spatio-Temporal Failure Mode Analysis
      • Tabular Data
        • Failure Mode Analysis
      • Instance Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Label Drift
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Near Duplicates Detection
      • Super Resolution
        • Semantic Similarity
        • Active Learning
        • Near Duplicates Detection
        • Outlier Detection
      • OCR
        • Missing Value Test
        • Outlier Detection
      • Image Classification
        • Failure Mode Analysis
        • Labelling Quality Test
        • Class Imbalance
        • Drift Detection
        • Near Duplicates Test
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Image Property Drift Detection
      • Event Detection
        • Failure Mode Analysis
        • A/B Test
    • Metric Glossary
    • Upload custom model
    • Event Detection
      • Upload Model
      • Generate Inference
      • Run tests
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
  • Support
Powered by GitBook
On this page
  • Execute Test
  • Analysing Test Results

Was this helpful?

  1. RagaAI Prism
  2. Test Inventory
  3. Image Classification

Failure Mode Analysis

Failure Mode Analysis enables you to define rules on metrics by setting thresholds. Based on rules, get a sorted list of all clusters that are breaching the threshold and clusters that are within the threshold. RagaAI will create clusters using the embeddings of the Dataset. Clusters that underperform and breaches the threshold will be highlighted on the Issue Stats screen.

Execute Test

The following code snippet is designed to perform a Failure Mode Analysis test on Language Models (LLMs) within the RagaAI environment.

First, you define the rules that will be used to evaluate the Image Classification model performance based on various metrics such as Precision, Recall and F1Score. These rules help in assessing the model's performance

This code snippet is for executing a Failure Mode Analysis (FMA) Test for an Image Classification model. Let's break down each part of the code:

rules = FMARules()
rules.add(metric="Precision", conf_threshold=0.5, metric_threshold=0.5, label="live")

cls_default = clustering(test_session=test_session,
                         dataset_name = "phase2-live-dataset",
                         method="k-means",
                         embedding_col="ImageVectorsM1",
                         level="image",
                         args= {"numOfClusters": 5})

edge_case_detection = failure_mode_analysis(test_session=test_session,
                                            dataset_name = "ic_fma_testdata",
                                            test_name = "Failure Mode Analysis",
                                            model = "modelB",
                                            gt = "GT",
                                            type="embeddings",
                                            clustering = cls_default,
                                            rules = rules,
                                            output_type="multi-label")

test_session.add(edge_case_detection)
test_session.run()
  • rules = FMARules():Initialises the rules for the FMA test specifically tailored for image classification.

    • Adds a rule for evaluating the model's performance.

    • metric: Specifies the performance metric to evaluate, here it is "Precision".

    • conf_threshold: The confidence threshold for predictions. Predictions with a confidence score below this threshold are likely considered invalid or uncertain.

    • metric_threshold: The minimum acceptable value for the specified metric. Here, the precision must be at least 0.5.

    • label: Specifies the label to which this rule applies. In this case, the rule is specific to the "live" label in a multi-label classification scenario.

  • clustering(): Configures the clustering parameters for grouping similar failure modes.

    • dataset_name: The name of the dataset to be used for the clustering process.

    • method: The clustering technique used; here it is "k-means".

    • embedding_col: The column containing embedding vectors of images to be used for clustering.

    • level: Specifies the level at which clustering is applied; "image" indicates that each image is treated as an individual data point.

    • args: Additional arguments for the clustering method, like the number of clusters to form.

  • failure_mode_analysis()

    • Sets up the failure mode analysis for the image classification model.

    • dataset_name: The name of the dataset for the FMA test.

    • test_name: A unique identifier for this test.

    • model: The identifier of the image classification model being tested, here "modelB".

    • gt: Refers to "ground truth", enter the ground truth in schema mapping

    • type: The type of analysis, here it's based on "embeddings".

    • clustering: The clustering configuration defined earlier.

    • rules: The set of evaluation rules defined at the beginning.

    • output_type: Indicates the type of classification, which is "multi-label" in this case.

  • test_session.add(): Registers the labelling quality test with the session.

  • test_session.run(): Starts the execution of all tests in the session, including your labelling quality test.

This setup in RagaAI helps you effectively analyse and identify failure modes in your Image Classification Model, providing valuable insights for model improvement and refinement."

Analysing Test Results

Clustering (Embedding View)

  1. Navigate to Embedding View: Find the interactive visualisation interface in RagaAI.

  2. Visualise Clusters: Utilise the tool to visualise complex clustering, uncovering hidden patterns and structures in the data.

  3. Select Data Points: Use the lasso tool to select specific data points of interest and analyse their results.

Visualising Data

  1. Grid View: Access the grid view to see data points within the selected clusters.

  2. Data Filtering: Use this feature to focus on specific subsets of your dataset that meet certain conditions, helping to extract meaningful patterns and trends.

Understanding Clustering and Threshold Breaches

  • Cluster Analysis: RagaAI creates clusters using the embeddings of the dataset. These clusters group similar data points together.

  • Identifying Underperforming Clusters: Clusters that underperform (breaching the threshold) will be highlighted on the Issue Stats screen.

Navigating and Interpreting Results

  • Directly Look at Problematic Clusters: Users can quickly identify clusters responsible for underperformance and assess their impact on the overall model.

  • In-Depth Analysis: Dive deeper into specific clusters or data points to understand the root causes of underperformance.

Data Analysis

  1. Switch to Analysis Tab: To get a detailed performance report, go to the Analysis tab.

  2. View Performance Metrics: Examine metrics like label-wise performance and temporal graphs.

  3. Confusion Matrix: The class-based confusion matrix in Failure Mode Analysis provides a detailed breakdown of performance for each class. Users can view the confusion matrix in three ways:

    • Absolute: Shows the absolute number of pixels.

    • Normalised (Ground Truth): Normalises values with respect to the ground truth.

    • Normalised (Model): Normalises values with respect to model inference.

Practical Tips

  • Set Realistic Thresholds: Choose thresholds that reflect the expected performance of your model.

  • Leverage Visual Tools: Make full use of RagaAI’s visualisation capabilities to gain insights that might not be apparent from raw data alone.

By following these steps, users can efficiently leverage the Failure Mode Analysis test to gain a comprehensive understanding of their model's performance, identify key areas for improvement, and make data-driven decisions to enhance model accuracy and reliability.

PreviousImage ClassificationNextLabelling Quality Test

Last updated 1 year ago

Was this helpful?