Custom Metric
Last updated
Last updated
This functionality allows you to create multi-step evaluation pipelines, enabling complex workflows that uses Large Language Models (LLMs) and/or Python-based scripting. Each metric can be composed of multiple steps, where each step’s output is available to subsequent steps. By chaining steps together, you can design flexible and robust evaluation metrics for your unique use cases.
You can find a dedicated Custom Metrics tab in the main navigation menu of the RagaAI Catalyst platform. Here, you can:
View all custom metrics you have created (or have access to).
Create new metrics from scratch.
Edit existing metrics—add, remove, or modify steps.
Manage them through version control and deployments.
Click on Create New Metric.
Provide a Metric Name (required, must be unique).
Add a Description (optional, up to 30 characters).
After naming and describing the metric:
Click Add Step.
Choose the Step Type:
Custom Metric (LLM Call)
Python
You can add as many steps as you need. Each step can reference the outputs of prior steps, giving you virtually unlimited flexibility in how you design your metric.
Custom Metric (LLM Call)
Prompt Editor:
Insert variables using {{variable_name}}.
Configure the System Role and User Role instructions.
Model Configuration:
Select from available LLMs (e.g., GPT-4).
Set parameters like Max Tokens, Temperature, etc.
Any output from previous steps can be referenced with the syntax {{step_name.response}}.
Python
Function Definition:
The function name must match the step name and must be uniquely defined once.
Only one top-level function definition is allowed. If more than one are found, you will see an error: Error: Only one top-level function definition is allowed. Found {count}. Nested functions are allowed.
Inputs can include outputs from previous steps using {{step_name.response}}.
Code is securely run in a sandboxed environment.
The output can be used in subsequent steps.
Steps run in the sequence they appear in the Steps Panel (top to bottom). Each step’s output is stored for reference and possible use by subsequent steps.
{{step_name.response}}
Ensure step_name is correct. A missing or misspelled name will lead to an error.
Metric: Hallucination Check
Step 1: Claims Generation (LLM Call)
Step 2: Claims Verification (LLM Call)
Step 3: Scoring (Python)
Step 1: Prompt the LLM to extract factual claims from a given {{response}}.
Step 2: Prompt the LLM again to verify each claim against a provided {{context}}.
Step 3: Use Python to score the final verification. If any claim is “refuted,” return a score of 1 (indicating a hallucination).
After all steps are executed, RagaAI Catalyst looks at the final step’s output to determine whether to produce a:
Score (0 – 1)
Boolean (0/1)
During metric setup, you can select which grading criteria is appropriate. Only the last step’s output (and not intermediate steps) is considered for this grading assessment.
Runs the entire metric workflow from the first step to the last.
Provides a Results Viewer, showing output for each step.
After the test run, click Verify Grading Criteria to validate that the final output meets the chosen criteria.
On success/failure, you’ll see a status alert, and the last step’s output is revealed for review.
Like the Playground, each commit creates a new version of your metric.
Keep track of iterations as you refine your steps.
Click Deploy to Eval to open a modal.
You’ll see the Current Deployed Version and can select another version to deploy. (If none is deployed, it shows None.)
Commit Version: You can commit your current configuration as a new version and then deploy it.
Unlike the Playground, there’s no “default” version. Instead, you explicitly choose which version to deploy.
Once a metric is committed:
It becomes visible in a new tab called Custom Metric when configuring evaluations on your dataset.
Map Variables to your dataset columns. For example:
prompt → Response
context → Context
response → User Response Column
Upon running, the system automatically executes the steps in your custom metric and produces a final evaluation.