LogoLogo
Slack CommunityCatalyst Login
  • Welcome
  • RagaAI Catalyst
    • User Quickstart
    • Concepts
      • Configure Your API Keys
      • Supported LLMs
        • OpenAI
        • Gemini
        • Azure
        • AWS Bedrock
        • ANTHROPIC
      • Catalyst Access/Secret Keys
      • Enable Custom Gateway
      • Uploading Data
        • Create new project
        • RAG Datset
        • Chat Dataset
          • Prompt Format
        • Logging traces (LlamaIndex, Langchain)
        • Trace Masking Functions
        • Trace Level Metadata
        • Correlating Traces with External IDs
        • Add Dataset
      • Running RagaAI Evals
        • Executing Evaluations
        • Compare Datasets
      • Analysis
      • Embeddings
    • RagaAI Metric Library
      • RAG Metrics
        • Hallucination
        • Faithfulness
        • Response Correctness
        • Response Completeness
        • False Refusal
        • Context Relevancy
        • Context Precision
        • Context Recall
        • PII Detection
        • Toxicity
      • Chat Metrics
        • Agent Quality
        • Instruction Adherence
        • User Chat Quality
      • Text-to-SQL
        • SQL Response Correctness
        • SQL Prompt Ambiguity
        • SQL Context Ambiguity
        • SQL Context Sufficiency
        • SQL Prompt Injection
      • Text Summarization
        • Summary Consistency
        • Summary Relevance
        • Summary Fluency
        • Summary Coherence
        • SummaC
        • QAG Score
        • ROUGE
        • BLEU
        • METEOR
        • BERTScore
      • Information Extraction
        • MINEA
        • Subjective Question Correction
        • Precision@K
        • Chunk Relevance
        • Entity Co-occurrence
        • Fact Entropy
      • Code Generation
        • Functional Correctness
        • ChrF
        • Ruby
        • CodeBLEU
        • Robust Pass@k
        • Robust Drop@k
        • Pass-Ratio@n
      • Marketing Content Evaluation
        • Engagement Score
        • Misattribution
        • Readability
        • Topic Coverage
        • Fabrication
      • Learning Management System
        • Topic Coverage
        • Topic Redundancy
        • Question Redundancy
        • Answer Correctness
        • Source Citability
        • Difficulty Level
      • Additional Metrics
        • Guardrails
          • Anonymize
          • Deanonymize
          • Ban Competitors
          • Ban Substrings
          • Ban Topics
          • Code
          • Invisible Text
          • Language
          • Secret
          • Sentiment
          • Factual Consistency
          • Language Same
          • No Refusal
          • Reading Time
          • Sensitive
          • URL Reachability
          • JSON Verify
        • Vulnerability Scanner
          • Bullying
          • Deadnaming
          • SexualContent
          • Sexualisation
          • SlurUsage
          • Profanity
          • QuackMedicine
          • DAN 11
          • DAN 10
          • DAN 9
          • DAN 8
          • DAN 7
          • DAN 6_2
          • DAN 6_0
          • DUDE
          • STAN
          • DAN_JailBreak
          • AntiDAN
          • ChatGPT_Developer_Mode_v2
          • ChatGPT_Developer_Mode_RANTI
          • ChatGPT_Image_Markdown
          • Ablation_Dan_11_0
          • Anthropomorphisation
      • Guardrails
        • Competitor Check
        • Gibberish Check
        • PII
        • Regex Check
        • Response Evaluator
        • Toxicity
        • Unusual Prompt
        • Ban List
        • Detect Drug
        • Detect Redundancy
        • Detect Secrets
        • Financial Tone Check
        • Has Url
        • HTML Sanitisation
        • Live URL
        • Logic Check
        • Politeness Check
        • Profanity Check
        • Quote Price
        • Restrict Topics
        • SQL Predicates Guard
        • Valid CSV
        • Valid JSON
        • Valid Python
        • Valid Range
        • Valid SQL
        • Valid URL
        • Cosine Similarity
        • Honesty Detection
        • Toxicity Hate Speech
    • Prompt Playground
      • Concepts
      • Single-Prompt Playground
      • Multiple Prompt Playground
      • Run Evaluations
      • Using Prompt Slugs with Python SDK
      • Create with AI using Prompt Wizard
      • Prompt Diff View
    • Synthetic Data Generation
    • Gateway
      • Quickstart
    • Guardrails
      • Quickstart
      • Python SDK
    • RagaAI Whitepapers
      • RagaAI RLEF (RAG LLM Evaluation Framework)
    • Agentic Testing
      • Quickstart
      • Concepts
        • Tracing
          • Langgraph (Agentic Tracing)
          • RagaAI Catalyst Tracing Guide for Azure OpenAI Users
        • Dynamic Tracing
        • Application Workflow
      • Create New Dataset
      • Metrics
        • Hallucination
        • Toxicity
        • Honesty
        • Cosine Similarity
      • Compare Traces
      • Compare Experiments
      • Add metrics locally
    • Custom Metric
    • Auto Prompt Optimization
    • Human Feedback & Annotations
      • Thumbs Up/Down
      • Add Metric Corrections
      • Corrections as Few-Shot Examples
      • Tagging
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
      • Evaluation Deployment Guide
        • Evaluation Maintenance Guide
    • Fine Tuning (OpenAI)
    • Integration
    • SDK Release Notes
      • ragaai-catalyst 2.1.7
  • RagaAI Prism
    • Quickstart
    • Sandbox Guide
      • Object Detection
      • LLM Summarization
      • Semantic Segmentation
      • Tabular Data
      • Super Resolution
      • OCR
      • Image Classification
      • Event Detection
    • Test Inventory
      • Object Detection
        • Failure Mode Analysis
        • Model Comparison Test
        • Drift Detection
        • Outlier Detection
        • Data Leakage Test
        • Labelling Quality Test
        • Scenario Imbalance
        • Class Imbalance
        • Active Learning
        • Image Property Drift Detection
      • Large Language Model (LLM)
        • Failure Mode Analysis
      • Semantic Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Active Learning
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Data Leakage Test
        • Outlier Detection
        • Label Drift
        • Semantic Similarity
        • Near Duplicates Detection
        • Cluster Imbalance Test
        • Image Property Drift Detection
        • Spatio-Temporal Drift Detection
        • Spatio-Temporal Failure Mode Analysis
      • Tabular Data
        • Failure Mode Analysis
      • Instance Segmentation
        • Failure Mode Analysis
        • Labelling Quality Test
        • Drift Detection
        • Class Imbalance
        • Scenario Imbalance
        • Label Drift
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Near Duplicates Detection
      • Super Resolution
        • Semantic Similarity
        • Active Learning
        • Near Duplicates Detection
        • Outlier Detection
      • OCR
        • Missing Value Test
        • Outlier Detection
      • Image Classification
        • Failure Mode Analysis
        • Labelling Quality Test
        • Class Imbalance
        • Drift Detection
        • Near Duplicates Test
        • Data Leakage Test
        • Outlier Detection
        • Active Learning
        • Image Property Drift Detection
      • Event Detection
        • Failure Mode Analysis
        • A/B Test
    • Metric Glossary
    • Upload custom model
    • Event Detection
      • Upload Model
      • Generate Inference
      • Run tests
    • On-Premise Deployment
      • Enterprise Deployment Guide for AWS
      • Enterprise Deployment Guide for Azure
  • Support
Powered by GitBook
On this page
  • 1. Introduction
  • 2. The Agentic Application Evaluation Framework (AAEF)
  • 3. Evaluation Methodology
  • 4. Case Studies
  • 5. Conclusion

Was this helpful?

RagaAI AAEF (Agentic Application Evaluation Framework)

As Agentic AI systems continue to evolve and gain prominence across various industries, the need for robust evaluation methodologies becomes increasingly critical. This whitepaper introduces a comprehensive framework for evaluating Agentic AI applications, drawing insights from recent research and industry best practices. Our proposed framework, the Agentic Application Evaluation Framework (AAEF), provides stakeholders with a structured approach to assess the performance, reliability, and effectiveness of Agentic AI systems.

1. Introduction

1.1 Background

Agentic AI refers to artificial intelligence systems capable of autonomous decision-making and action to achieve specific goals. These systems are characterised by their ability to:

  1. Utilise external tools and APIs (Tool Calling)

  2. Maintain and leverage information from past interactions (Memory)

  3. Formulate and execute strategies to accomplish tasks (Planning)

As these systems grow in complexity and capability, a standardised evaluation framework becomes essential for ensuring their effectiveness and performance.

1.2 Importance of Evaluation

Systematic evaluation of Agentic AI workflows is crucial for:

  • Ensuring the effectiveness of autonomous systems

  • Guiding continuous improvement in AI development

  • Facilitating comparison and benchmarking of different Agentic AI systems

  • Informing stakeholders' decisions regarding AI implementation and governance

2. The Agentic Application Evaluation Framework (AAEF)

The AAEF comprises four primary metrics, each designed to evaluate a critical aspect of Agentic AI workflows:

  1. Tool Utilisation Efficacy (TUE)

  2. Memory Coherence and Retrieval (MCR)

  3. Strategic Planning Index (SPI)

  4. Component Synergy Score (CSS)

2.1 Tool Utilisation Efficacy (TUE)

TUE assesses the AI agent's ability to select and use appropriate tools effectively.

TUE = α * (Tool Selection Accuracy) + β * (Tool Usage Efficiency) + γ * (API Call Precision)

Where:

  • Tool Selection Accuracy: The rate at which the AI chooses the most appropriate tool for a given task.

  • Tool Usage Efficiency: A measure of how optimally the AI uses selected tools, considering factors like unnecessary calls and resource usage.

  • API Call Precision: The accuracy and appropriateness of parameters used in API calls.

  • α, β, and γ are weights that can be adjusted based on the specific use case.

2.2 Memory Coherence and Retrieval (MCR)

MCR evaluates the agent's ability to store, retrieve, and utilise information effectively.

MCR = (Context Preservation Score * Information Retention Rate) / (1 + Retrieval Latency)

Where:

  • Context Preservation Score: A measure of how well the AI maintains relevant context across interactions.

  • Information Retention Rate: The proportion of important information retained over time.

  • Retrieval Latency: The time taken to retrieve stored information.

2.3 Strategic Planning Index (SPI)

SPI measures the agent's ability to formulate and execute plans effectively.

SPI = (Goal Decomposition Efficiency * Plan Adaptability) * (1 - Plan Execution Error Rate)

Where:

  • Goal Decomposition Efficiency: The AI's ability to break down complex goals into manageable sub-tasks.

  • Plan Adaptability: How well the AI adjusts plans in response to changing circumstances.

  • Plan Execution Error Rate: The frequency of errors or failures in executing planned actions.

2.4 Component Synergy Score (CSS)

CSS assesses how well the different components of the Agentic AI system work together.

CSS = (Cross-Component Utilisation Rate * Workflow Cohesion Index) / (1 + Component Conflict Rate)

Where:

  • Cross-Component Utilisation Rate: How often information or outputs from one component are effectively used by another.

  • Workflow Cohesion Index: A measure of the seamless integration among components.

  • Component Conflict Rate: The frequency of conflicts or inconsistencies between different components.

3. Evaluation Methodology

The AAEF employs a rigorous, multi-faceted approach to evaluate Agentic AI systems. This methodology ensures comprehensive assessment across all critical aspects of AI performance.

3.1 Automated Assessment

Automated metrics form the backbone of the AAEF, providing quantitative, reproducible measurements of AI performance. These metrics are designed to capture specific aspects of each evaluation component:

  • Tool Selection Accuracy:

    • Methodology: Implement a logging system that records every tool selection made by the AI.

    • Calculation: (Correct tool selections) / (Total tool selections) over a predefined evaluation period.

    • Validation: Periodically review a subset of selections manually to ensure accuracy.

  • Information Retention Rate:

    • Methodology: Develop a tagging system for important information items introduced during AI interactions.

    • Calculation: (Correctly retained tagged items) / (Total tagged items) measured at set intervals.

    • Validation: Use controlled test scenarios with known information to verify retention accuracy.

  • Plan Execution Error Rate:

    • Methodology: Implement detailed logging of all plan steps and their outcomes.

    • Calculation: (Failed or erroneous steps) / (Total plan steps) across multiple task executions.

    • Validation: Conduct simulations with predefined failure points to calibrate error detection.

3.2 LLM-Assisted Evaluation

For nuanced assessments that require contextual understanding, we employ Large Language Models (LLMs) as impartial evaluators. This approach is particularly effective for components that resist simple quantification:

  • Prompt Engineering:

    • Develop a standardised set of prompts for each evaluation component.

    • Ensure prompts are clear, unambiguous, and designed to elicit specific, relevant information.

    • Example: "Analyse the following AI-generated plan: [plan]. On a scale of 0 to 1, rate its adaptability to changing circumstances. Provide a brief justification for your rating."

  • Context Provision:

    • Supply LLMs with comprehensive context, including:

      • Detailed description of the task and its objectives

      • Relevant background information on the AI system being evaluated

      • Specific criteria for evaluation

    • Ensure consistency in context provision across evaluations to maintain comparability.

  • Response Interpretation:

    • Develop a rubric for interpreting LLM responses on a standardised scale (0-1).

    • Implement multiple LLM evaluations for each component to mitigate individual biases.

    • Use statistical methods (e.g., inter-rater reliability measures) to assess consistency across LLM evaluations.

3.3 Hybrid Approach Implementation

The AAEF leverages a hybrid approach, combining automated metrics and LLM-assisted evaluation:

  • Data Collection:

    • Implement comprehensive logging systems to capture all relevant AI actions and outputs.

    • Develop APIs for real-time data extraction from AI systems during operation.

  • Automated Analysis:

    • Develop data processing pipelines to calculate automated metrics continuously.

    • Implement anomaly detection algorithms to flag unusual patterns or performance drops.

  • LLM Evaluation Triggers:

    • Set up conditional triggers for LLM evaluations based on specific thresholds or events in the automated metrics.

    • Schedule regular LLM evaluations to provide ongoing qualitative assessments.

  • Integration and Reporting:

    • Develop a weighted scoring system that combines automated and LLM-assisted evaluations.

    • Implement a dashboard for real-time visualisation of AAEF metrics and trends.

    • Generate comprehensive reports at regular intervals, highlighting key performance indicators and areas for improvement.

  • Continuous Refinement:

    • Establish a feedback loop where evaluation results inform the refinement of both the AI system and the AAEF itself.

    • Regularly review and update evaluation methodologies based on new research and industry developments.

By employing this robust, multi-faceted methodology, the AAEF provides a comprehensive and nuanced evaluation of Agentic AI systems, offering valuable insights for researchers, developers, and stakeholders in the field.

4. Case Studies

4.1 Data Analysis AI Assistant

In this case study, we evaluated a Data Analysis AI Assistant tasked with performing sentiment analysis on a large dataset of customer reviews.

Key findings:

  • TUE Score: 0.929

  • Strong performance in tool selection (VADER sentiment analysis tool)

  • High efficiency in tool usage and API call precision

  • Minor room for improvement in preprocessing and result aggregation

4.2 Autonomous Delivery Robot

We assessed an autonomous delivery robot responsible for navigating city streets to deliver packages from a warehouse to customers' homes.

Key findings:

  • SPI Score: 0.73

  • Efficient goal decomposition into manageable sub-tasks

  • Strong adaptability to unexpected circumstances (e.g., road closures)

  • Low but non-zero error rate in plan execution, indicating room for improvement

4.3 Customer Service Chatbot

A customer service chatbot was evaluated on its ability to maintain context and retrieve relevant information over time.

Key findings:

  • MCR Score: 0.48

  • Moderate performance in context preservation and information retention

  • Quick information retrieval

  • Significant room for improvement in preserving all crucial details

5. Conclusion

The Agentic Application Evaluation Framework (AAEF) provides a structured and comprehensive approach to assessing the performance of Agentic AI systems. By evaluating tool utilisation, memory management, strategic planning, and component integration, AAEF enables developers, researchers, and stakeholders to identify strengths and areas for improvement in their Agentic AI applications.

As the field of Agentic AI continues to evolve, frameworks like AAEF will play a crucial role in ensuring the development of effective and efficient AI systems. We encourage the AI community to adopt, refine, and expand upon this framework to drive the continued advancement of Agentic AI technology.

Last updated 9 months ago

Was this helpful?