LogoLogo
Slack CommunityCatalyst Login
  • Welcome
  • RagaAI Catalyst
    • User Quickstart
    • Concepts
      • Configure Your API Keys
      • Supported LLMs
        • OpenAI
        • Gemini
        • Azure
        • AWS Bedrock
        • ANTHROPIC
      • Catalyst Access/Secret Keys
      • Enable Custom Gateway
      • Uploading Data
        • Create new project
        • RAG Datset
        • Chat Dataset
          • Prompt Format
        • Logging traces (LlamaIndex, Langchain)
        • Trace Masking Functions
        • Trace Level Metadata
        • Correlating Traces with External IDs
        • Add Dataset
      • Running RagaAI Evals
        • Executing Evaluations
        • Compare Datasets
      • Analysis
      • Embeddings
    • RagaAI Metric Library
      • RAG Metrics
        • Hallucination
        • Faithfulness
        • Response Correctness
        • Response Completeness
        • False Refusal
        • Context Relevancy
        • Context Precision
        • Context Recall
        • PII Detection
        • Toxicity
      • Chat Metrics
        • Agent Quality
        • Instruction Adherence
        • User Chat Quality
      • Text-to-SQL
        • SQL Response Correctness
        • SQL Prompt Ambiguity
        • SQL Context Ambiguity
        • SQL Context Sufficiency
        • SQL Prompt Injection
      • Text Summarization
        • Summary Consistency
        • Summary Relevance
        • Summary Fluency
        • Summary Coherence
        • SummaC
        • QAG Score
        • ROUGE
        • BLEU
        • METEOR
        • BERTScore
      • Information Extraction
        • MINEA
        • Subjective Question Correction
        • Precision@K
        • Chunk Relevance
        • Entity Co-occurrence
        • Fact Entropy
      • Code Generation
        • Functional Correctness
        • ChrF
        • Ruby
        • CodeBLEU
        • Robust Pass@k
        • Robust Drop@k
        • Pass-Ratio@n
      • Marketing Content Evaluation
        • Engagement Score
        • Misattribution
        • Readability
        • Topic Coverage
        • Fabrication
      • Learning Management System
        • Topic Coverage
        • Topic Redundancy
        • Question Redundancy
        • Answer Correctness
        • Source Citability
        • Difficulty Level
      • Additional Metrics
        • Guardrails
          • Anonymize
          • Deanonymize
          • Ban Competitors
          • Ban Substrings
          • Ban Topics
          • Code
          • Invisible Text
          • Language
          • Secret
          • Sentiment
          • Factual Consistency
          • Language Same
          • No Refusal
          • Reading Time
          • Sensitive
          • URL Reachability
          • JSON Verify
        • Vulnerability Scanner
          • Bullying
          • Deadnaming
          • SexualContent
          • Sexualisation
          • SlurUsage
          • Profanity
          • QuackMedicine
          • DAN 11
          • DAN 10
          • DAN 9
          • DAN 8
          • DAN 7
          • DAN 6_2
          • DAN 6_0
          • DUDE
          • STAN
          • DAN_JailBreak
          • AntiDAN
          • ChatGPT_Developer_Mode_v2
          • ChatGPT_Developer_Mode_RANTI
          • ChatGPT_Image_Markdown
          • Ablation_Dan_11_0
          • Anthropomorphisation
      • Guardrails
        • Competitor Check
        • Gibberish Check
        • PII
        • Regex Check
        • Response Evaluator
        • Toxicity
        • Unusual Prompt
        • Ban List
        • Detect Drug
        • Detect Redundancy
        • Detect Secrets
        • Financial Tone Check
        • Has Url
        • HTML Sanitisation
        • Live URL
        • Logic Check
        • Politeness Check
        • Profanity Check
        • Quote Price
        • Restrict Topics
        • SQL Predicates Guard
        • Valid CSV
        • Valid JSON
        • Valid Python
        • Valid Range
        • Valid SQL
        • Valid URL
        • Cosine Similarity
        • Honesty Detection
        • Toxicity Hate Speech
    • Prompt Playground
      • Concepts
      • Single-Prompt Playground
      • Multiple Prompt Playground
      • Run Evaluations
      • Using Prompt Slugs with Python SDK
      • Create with AI using Prompt Wizard
      • Prompt Diff View
    • Synthetic Data Generation
    • Gateway
      • Quickstart
    • Guardrails
      • Quickstart
      • Python SDK
    • RagaAI Whitepapers
      • RagaAI RLEF (RAG LLM Evaluation Framework)
    • Agentic Testing
      • Quickstart
      • Concepts
        • Tracing
          • Langgraph (Agentic Tracing)
          • RagaAI Catalyst Tracing Guide for Azure OpenAI Users
        • Dynamic Tracing
        • Application Workflow
      • Create New Dataset
      • Metrics
        • Hallucination
        • Toxicity
        • Honesty
        • Cosine Similarity
      • Compare Traces
      • Compare Experiments
      • Add metrics locally
    • Custom Metric
    • Auto Prompt Optimization
    • Human Feedback & Annotations
      • Thumbs Up/Down
      • Add Metric Corrections
      • Corrections as Few-Shot Examples
      • Tagging
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
      • Evaluation Deployment Guide
        • Evaluation Maintenance Guide
    • Fine Tuning (OpenAI)
    • Integration
    • SDK Release Notes
      • ragaai-catalyst 2.1.7
  • RagaAI Prism
    • Quickstart
    • Sandbox Guide
      • Object Detection
      • LLM Summarization
      • Semantic Segmentation
      • Tabular Data
      • Super Resolution
      • OCR
      • Image Classification
      • Event Detection
    • Test Inventory
      • Object Detection
        • Failure Mode Analysis
        • Model Comparison Test
        • Drift Detection
        • Outlier Detection
        • Data Leakage Test
        • Labelling Quality Test
        • Scenario Imbalance
        • Class Imbalance
        • Active Learning
        • Image Property Drift Detection
      • Large Language Model (LLM)
        • Failure Mode Analysis
      • Semantic Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Active Learning
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Data Leakage Test
        • Outlier Detection
        • Label Drift
        • Semantic Similarity
        • Near Duplicates Detection
        • Cluster Imbalance Test
        • Image Property Drift Detection
        • Spatio-Temporal Drift Detection
        • Spatio-Temporal Failure Mode Analysis
      • Tabular Data
        • Failure Mode Analysis
      • Instance Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Label Drift
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Near Duplicates Detection
      • Super Resolution
        • Semantic Similarity
        • Active Learning
        • Near Duplicates Detection
        • Outlier Detection
      • OCR
        • Missing Value Test
        • Outlier Detection
      • Image Classification
        • Failure Mode Analysis
        • Labelling Quality Test
        • Class Imbalance
        • Drift Detection
        • Near Duplicates Test
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Image Property Drift Detection
      • Event Detection
        • Failure Mode Analysis
        • A/B Test
    • Metric Glossary
    • Upload custom model
    • Event Detection
      • Upload Model
      • Generate Inference
      • Run tests
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
  • Support
Powered by GitBook
On this page
  • Custom Metrics: Multi-Step Evaluation
  • Creating a New Metric
  • Adding Steps
  • Custom Metric (LLM Call)
  • Custom Metric (LLM Call)
  • Python
  • Multi-Step Workflow
  • To reference a step’s output, use:
  • Example Workflow
  • Detailed Flow:
  • Grading Criteria
  • Testing and Verification
  • Version Control and Deployment
  • Using a Custom Metric in Your Evaluations

Was this helpful?

  1. RagaAI Catalyst

Custom Metric

PreviousAdd metrics locallyNextAuto Prompt Optimization

Last updated 3 months ago

Was this helpful?

Custom Metrics: Multi-Step Evaluation

This functionality allows you to create multi-step evaluation pipelines, enabling complex workflows that uses Large Language Models (LLMs) and/or Python-based scripting. Each metric can be composed of multiple steps, where each step’s output is available to subsequent steps. By chaining steps together, you can design flexible and robust evaluation metrics for your unique use cases.

You can find a dedicated Custom Metrics tab in the main navigation menu of the RagaAI Catalyst platform. Here, you can:

  • View all custom metrics you have created (or have access to).

  • Create new metrics from scratch.

  • Edit existing metrics—add, remove, or modify steps.

  • Manage them through version control and deployments.


Creating a New Metric

Initiation:

  1. Click on Create New Metric.

  2. Provide a Metric Name (required, must be unique).

  3. Add a Description (optional, up to 30 characters).

Adding Steps

After naming and describing the metric:

  1. Click Add Step.

  2. Choose the Step Type:

    1. Custom Metric (LLM Call)

    2. Python

You can add as many steps as you need. Each step can reference the outputs of prior steps, giving you virtually unlimited flexibility in how you design your metric.


Custom Metric (LLM Call)

Custom Metric (LLM Call)

When you add a Custom Metric (LLM Call) step, you will see the following configuration sections:

  1. Prompt Editor

    • Insert variables using the syntax {{variable_name}}.

    • You can configure the System Role and User Role instructions for the LLM.

  2. Model Configuration

    • Select from available LLMs (e.g., GPT-4).

    • Set parameters like Max Tokens, Temperature, etc.

    • Any output from previous steps can be referenced using {{step_name.response}}.

  3. Reasoning

    • Reasoning Checkbox: If you want the LLM’s reasoning for how it arrived at its score or response, check the Reasoning checkbox. This ensures the LLM call also returns a reasoning string in addition to the primary output.

    • Reasoning Dropdown: In the main evaluation configuration (outside the step editor), you can select which step’s reasoning will be displayed when running Evals. This gives you the flexibility to see the most relevant step’s reasoning during your final evaluation review.

Note: Enabling reasoning will store an additional field from the same LLM call. You can use or display this reasoning later for review, debugging, or transparency into how the LLM arrived at its final answer.

Python

  • Function Definition:

    • The function name must match the step name and must be uniquely defined once.

    • Only one top-level function definition is allowed. If more than one are found, you will see an error: Error: Only one top-level function definition is allowed. Found {count}. Nested functions are allowed.

  • Inputs can include outputs from previous steps using {{step_name.response}}.

  • Code is securely run in a sandboxed environment.

  • The output can be used in subsequent steps.

Multi-Step Workflow

Execution Order

Steps run in the sequence they appear in the Steps Panel (top to bottom). Each step’s output is stored for reference and possible use by subsequent steps.

Inter-Step Variable Referencing

To reference a step’s output, use:

{{step_name.response}}

Ensure step_name is correct. A missing or misspelled name will lead to an error.

Example Workflow

Metric: Hallucination Check

Step 1: Claims Generation (LLM Call)

Step 2: Claims Verification (LLM Call)

Step 3: Scoring (Python)

Detailed Flow:

Step 1: Prompt the LLM to extract factual claims from a given {{response}}.

Step 2: Prompt the LLM again to verify each claim against a provided {{context}}.

Step 3: Use Python to score the final verification. If any claim is “refuted,” return a score of 1 (indicating a hallucination).

Grading Criteria

After all steps are executed, RagaAI Catalyst looks at the final step’s output to determine whether to produce a:

  • Score (0 – 1)

  • Boolean (0/1)

During metric setup, you can select which grading criteria is appropriate. Only the last step’s output (and not intermediate steps) is considered for this grading assessment.


Testing and Verification

Test Eval Button

  • Runs the entire metric workflow from the first step to the last.

  • Provides a Results Viewer, showing output for each step.

Grading Criteria Verification

  • After the test run, click Verify Grading Criteria to validate that the final output meets the chosen criteria.

  • On success/failure, you’ll see a status alert, and the last step’s output is revealed for review.


Version Control and Deployment

Commit Versions

  • Like the Playground, each commit creates a new version of your metric.

  • Keep track of iterations as you refine your steps.

Deploy to Eval

  • Click Deploy to Eval to open a modal.

  • You’ll see the Current Deployed Version and can select another version to deploy. (If none is deployed, it shows None.)

  • Commit Version: You can commit your current configuration as a new version and then deploy it.

  • Unlike the Playground, there’s no “default” version. Instead, you explicitly choose which version to deploy.


Using a Custom Metric in Your Evaluations

Once a metric is committed:

  • It becomes visible in a new tab called Custom Metric when configuring evaluations on your dataset.

  • Map Variables to your dataset columns. For example:

    • prompt → Response

    • context → Context

    • response → User Response Column

Upon running, the system automatically executes the steps in your custom metric and produces a final evaluation.