LogoLogo
Slack CommunityCatalyst Login
  • Welcome
  • RagaAI Catalyst
    • User Quickstart
    • Concepts
      • Configure Your API Keys
      • Supported LLMs
        • OpenAI
        • Gemini
        • Azure
        • AWS Bedrock
        • ANTHROPIC
      • Catalyst Access/Secret Keys
      • Enable Custom Gateway
      • Uploading Data
        • Create new project
        • RAG Datset
        • Chat Dataset
          • Prompt Format
        • Logging traces (LlamaIndex, Langchain)
        • Trace Masking Functions
        • Trace Level Metadata
        • Correlating Traces with External IDs
        • Add Dataset
      • Running RagaAI Evals
        • Executing Evaluations
        • Compare Datasets
      • Analysis
      • Embeddings
    • RagaAI Metric Library
      • RAG Metrics
        • Hallucination
        • Faithfulness
        • Response Correctness
        • Response Completeness
        • False Refusal
        • Context Relevancy
        • Context Precision
        • Context Recall
        • PII Detection
        • Toxicity
      • Chat Metrics
        • Agent Quality
        • Instruction Adherence
        • User Chat Quality
      • Text-to-SQL
        • SQL Response Correctness
        • SQL Prompt Ambiguity
        • SQL Context Ambiguity
        • SQL Context Sufficiency
        • SQL Prompt Injection
      • Text Summarization
        • Summary Consistency
        • Summary Relevance
        • Summary Fluency
        • Summary Coherence
        • SummaC
        • QAG Score
        • ROUGE
        • BLEU
        • METEOR
        • BERTScore
      • Information Extraction
        • MINEA
        • Subjective Question Correction
        • Precision@K
        • Chunk Relevance
        • Entity Co-occurrence
        • Fact Entropy
      • Code Generation
        • Functional Correctness
        • ChrF
        • Ruby
        • CodeBLEU
        • Robust Pass@k
        • Robust Drop@k
        • Pass-Ratio@n
      • Marketing Content Evaluation
        • Engagement Score
        • Misattribution
        • Readability
        • Topic Coverage
        • Fabrication
      • Learning Management System
        • Topic Coverage
        • Topic Redundancy
        • Question Redundancy
        • Answer Correctness
        • Source Citability
        • Difficulty Level
      • Additional Metrics
        • Guardrails
          • Anonymize
          • Deanonymize
          • Ban Competitors
          • Ban Substrings
          • Ban Topics
          • Code
          • Invisible Text
          • Language
          • Secret
          • Sentiment
          • Factual Consistency
          • Language Same
          • No Refusal
          • Reading Time
          • Sensitive
          • URL Reachability
          • JSON Verify
        • Vulnerability Scanner
          • Bullying
          • Deadnaming
          • SexualContent
          • Sexualisation
          • SlurUsage
          • Profanity
          • QuackMedicine
          • DAN 11
          • DAN 10
          • DAN 9
          • DAN 8
          • DAN 7
          • DAN 6_2
          • DAN 6_0
          • DUDE
          • STAN
          • DAN_JailBreak
          • AntiDAN
          • ChatGPT_Developer_Mode_v2
          • ChatGPT_Developer_Mode_RANTI
          • ChatGPT_Image_Markdown
          • Ablation_Dan_11_0
          • Anthropomorphisation
      • Guardrails
        • Competitor Check
        • Gibberish Check
        • PII
        • Regex Check
        • Response Evaluator
        • Toxicity
        • Unusual Prompt
        • Ban List
        • Detect Drug
        • Detect Redundancy
        • Detect Secrets
        • Financial Tone Check
        • Has Url
        • HTML Sanitisation
        • Live URL
        • Logic Check
        • Politeness Check
        • Profanity Check
        • Quote Price
        • Restrict Topics
        • SQL Predicates Guard
        • Valid CSV
        • Valid JSON
        • Valid Python
        • Valid Range
        • Valid SQL
        • Valid URL
        • Cosine Similarity
        • Honesty Detection
        • Toxicity Hate Speech
    • Prompt Playground
      • Concepts
      • Single-Prompt Playground
      • Multiple Prompt Playground
      • Run Evaluations
      • Using Prompt Slugs with Python SDK
      • Create with AI using Prompt Wizard
      • Prompt Diff View
    • Synthetic Data Generation
    • Gateway
      • Quickstart
    • Guardrails
      • Quickstart
      • Python SDK
    • RagaAI Whitepapers
      • RagaAI RLEF (RAG LLM Evaluation Framework)
    • Agentic Testing
      • Quickstart
      • Concepts
        • Tracing
          • Langgraph (Agentic Tracing)
          • RagaAI Catalyst Tracing Guide for Azure OpenAI Users
        • Dynamic Tracing
        • Application Workflow
      • Create New Dataset
      • Metrics
        • Hallucination
        • Toxicity
        • Honesty
        • Cosine Similarity
      • Compare Traces
      • Compare Experiments
      • Add metrics locally
    • Custom Metric
    • Auto Prompt Optimization
    • Human Feedback & Annotations
      • Thumbs Up/Down
      • Add Metric Corrections
      • Corrections as Few-Shot Examples
      • Tagging
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
      • Evaluation Deployment Guide
        • Evaluation Maintenance Guide
    • Fine Tuning (OpenAI)
    • Integration
    • SDK Release Notes
      • ragaai-catalyst 2.1.7
  • RagaAI Prism
    • Quickstart
    • Sandbox Guide
      • Object Detection
      • LLM Summarization
      • Semantic Segmentation
      • Tabular Data
      • Super Resolution
      • OCR
      • Image Classification
      • Event Detection
    • Test Inventory
      • Object Detection
        • Failure Mode Analysis
        • Model Comparison Test
        • Drift Detection
        • Outlier Detection
        • Data Leakage Test
        • Labelling Quality Test
        • Scenario Imbalance
        • Class Imbalance
        • Active Learning
        • Image Property Drift Detection
      • Large Language Model (LLM)
        • Failure Mode Analysis
      • Semantic Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Active Learning
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Data Leakage Test
        • Outlier Detection
        • Label Drift
        • Semantic Similarity
        • Near Duplicates Detection
        • Cluster Imbalance Test
        • Image Property Drift Detection
        • Spatio-Temporal Drift Detection
        • Spatio-Temporal Failure Mode Analysis
      • Tabular Data
        • Failure Mode Analysis
      • Instance Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Label Drift
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Near Duplicates Detection
      • Super Resolution
        • Semantic Similarity
        • Active Learning
        • Near Duplicates Detection
        • Outlier Detection
      • OCR
        • Missing Value Test
        • Outlier Detection
      • Image Classification
        • Failure Mode Analysis
        • Labelling Quality Test
        • Class Imbalance
        • Drift Detection
        • Near Duplicates Test
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Image Property Drift Detection
      • Event Detection
        • Failure Mode Analysis
        • A/B Test
    • Metric Glossary
    • Upload custom model
    • Event Detection
      • Upload Model
      • Generate Inference
      • Run tests
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
  • Support
Powered by GitBook
On this page
  • Execute Test
  • Analysing Test Results

Was this helpful?

  1. RagaAI Prism
  2. Test Inventory
  3. Event Detection

A/B Test

A/B Test is a systematic method to compare two or more variations of a pipeline.

PreviousFailure Mode AnalysisNextMetric Glossary

Last updated 1 year ago

Was this helpful?

A/B Test helps users make informed decisions about pipeline improvements, feature engineering methodologies, and any changes to the underlying data processing pipeline. It aids in determining the most effective solutions for improving pipeline performance and their applicability to various scenarios. A/B Test is also useful in determining the robustness of pipelines to changes in data distribution over time, assuring adaptability to changing real-world situations.

Execute Test

The following code snippet is designed to perform a Failure Mode Analysis test on a specified dataset within the RagaAI environment.

First, you define the rules that will be used to evaluate the model's performance. These rules are based on metrics such as Difference Percentage ().

rules = EventABTestRules()
rules.add(metric="difference_percentage", IoU=0.5, _class="ALL", threshold=50.0, conf_threshold=0.2)

filters = Filter()
filters.add(TimestampFilter(gte="2021-01-01T00:00:00Z", lte="2025-01-15T00:00:00Z"))

model_comparison_check = event_ab_test(test_session=test_session,
                                               dataset_name="bdd_video_test_1",
                                               test_name=f"AB-test-unlabelled-{datetime.datetime.now().strftime('%Y%m%d%H%M%S')}",
                                               modelB="Red_Light_V2",
                                               modelA="Red_Light_V1",
                                               object_detection_modelB="Red_Light_V2",
                                               object_detection_modelA="Red_Light_V1",
                                               type="metadata",
                                               sub_type="unlabelled",
                                               output_type="event_detection",
                                               rules=rules,
                                               aggregation_level=["weather"],
                                               filter=filters)

test_session.add(model_comparison_check)
test_session.run()

  • EventABTestRules(): Initialises the rules for the A/B Test.

    • rules.add(): Adds a new rule with specific parameters:

      • metric: The performance metric to evaluate (e.g., Difference Percentage).

      • iou: This sets the required degree of overlap between consecutive frames' bounding boxes, influencing the sensitivity of object tracking in applications.

      • _class: Specifies the class these metrics apply to. "ALL" means all classes.

      • threshold: The minimum acceptable value for the metric.

      • conf_threshold: The minimum acceptable confidence value for a detection to be true.

  • EventABTestRules(): Initialises the rules for the A/B Test.

    • filters.add(): Adds a new filter with specific parameters:

      • TimestampFilter: Specifies the date and time range for the datapoints around which the test results are generated.

  • event_ab_test(): Configures the A/B Test with the following parameters:

    • test_session: Defines the test session created by the user with the project name, access key, secret key and the host token.

    • dataset_name: Specifies the dataset to be used by the user for the test.

    • test_name: Identifies with the test the user is running.

    • modelB: Contains the name of Pipeline B (second pipeline)

    • modelA: Contains the name of Pipeline A (first pipeline)

    • object_detection_modelB: This is the intermediate Model B object detection bounding box rendering for the videos.

    • object_detection_modelA: These are the intermediate Model A object detection bounding box rendering for the videos.

    • type: Specifies the level (embedding for cluster level and metadata for metadata level) on which the test is being run.

    • sub_type: Specifies the category of test (labelled or unlabelled).

    • output_type: Contains the usecase the test is being run on. For example: objecy_detection, event_detection and many more.

    • aggregation_level: Specifies the scenarios across which the test results are generated. This is only required incase failure mode analysis is being run at a metadata level.

    • rules: The previously defined rules for A/B Test.

    • filter: The previously defined filters for A/B Test test.

  • test_session.add(): Registers the failure mode analysis test with the test session.

  • test_session.run(): Starts the execution of all the tests added to the session, including the A/B Test.

Following this guide, you've successfully set up and initiated a A/B Test on the RagaAI Testing Platform.

Analysing Test Results

Navigating and Interpreting Results

  • Directly Look at Problematic Scenarios: Users can quickly identify where the divergence is the most and pin point cases where a pipeline is underperforming.

  • In-Depth Analysis: Dive deeper into specific scenarios or data points to understand the root causes of underperformance.

Data Analysis

  1. Switch to Analysis Tab: To get a detailed performance report, go to the Analysis tab.

  2. View Performance Metrics: Examine metrics like Event A/B Detections and Detections over time.

  3. Data Grid View: Users can drill down into individual datapoints and analyse results at a granular level.

Practical Tips

  • Set Realistic Thresholds: Choose thresholds that reflect the expected performance of your model.

  • Leverage Visual Tools: Make full use of RagaAI’s visualisation capabilities to gain insights that might not be apparent from raw data alone.

By following these steps, users can efficiently leverage the Failure Mode Analysis test to gain a comprehensive understanding of their model's performance, identify key areas for improvement, and make data-driven decisions to enhance model accuracy and reliability.

Refer Metric Glossary
Page cover image