Failure Mode Analysis

Analyze where LLMs fail. Spot reasoning errors, hallucinations, and inconsistencies.

Failure Mode Analysis enables you to define rules on metrics by setting thresholds. Based on rules, get a sorted list of all clusters that are breaching the threshold and clusters that are within the threshold. RagaAI will create clusters using the embeddings of the Dataset. Clusters that underperform and breaches the threshold will be highlighted on the Issue Stats screen.

Execute Test

The following code snippet is designed to perform a Failure Mode Analysis test on Language Models (LLMs) within the RagaAI environment.

First, you define the rules that will be used to evaluate the LLM's performance based on various linguistic metrics such as BLEU, Cosine Similarity, METEOR, and ROUGE. These rules help in assessing the model's accuracy and its capability to generate contextually relevant and coherent text.

rules = FMA_LLMRules()
rules.add(metric = 'accuracy', metric_threshold = 0.5, eval_metric='BLEU', threshold=0.1)
rules.add(metric = 'accuracy', metric_threshold = 0.5, eval_metric='CosineSimilarity', threshold=0.5)
rules.add(metric = 'accuracy', metric_threshold = 0.7, eval_metric='METEOR', threshold=0.2)
rules.add(metric = 'accuracy', metric_threshold = 0.5, eval_metric='ROUGE', threshold=0.25)

cls_default = clustering(test_session=test_session,
                         dataset_name="llm_dataset_testing",
                         method="k-means",
                         embedding_col="summary_vector",
                         level="image",
                         args={"numOfClusters": 5})

cls_default = clustering(test_session=test_session,
                         dataset_name="llm_dataset_testing",
                         method="k-means",
                         embedding_col="summary_vector",
                         level="image",
                         args={"numOfClusters": 5})

edge_case_detection = failure_mode_analysis_llm(test_session=test_session,
                                                dataset_name="dataset_name",
                                                test_name="fma_llm_1",
                                                model="modelA",
                                                gt="GT",
                                                rules=rules,
                                                type="fma",
                                                output_type="llm",
                                                prompt_col_name="document",
                                                model_column="summary",
                                                gt_column="reference_summary",
                                                embedding_col_name="document_vector",
                                                model_embedding_column="summary_vector",
                                                gt_embedding_column="reference_summary_vector",
                                                clustering=cls_default)
                                                
test_session.add(edge_case_detection)
test_session.run()

FMA_LLMRules(): Initialises the rules for the Failure Mode Analysis (FMA) test specifically designed for Language Models (LLMs).
rules.add(): Adds a new rule with specific parameters for evaluating the LLM's performance.
- metric: The performance metric to evaluate (e.g., 'accuracy'). This refers to a general performance measure of the LLM.
- metric_threshold: The minimum acceptable value for the metric. If the model's performance falls below this threshold, it may indicate a failure mode.
- eval_metric: The specific evaluation metric used to assess a particular aspect of the LLM's output (e.g., 'BLEU', 'CosineSimilarity', 'METEOR', 'ROUGE').
- threshold: The minimum acceptable value for the eval_metric. This threshold helps in identifying when the model's performance on this specific metric is inadequate.
clustering(): Sets the clustering method for grouping similar failure modes.
- test_session: Specifies the test session to which this clustering configuration applies.
- dataset_name: The name of the dataset on which the FMA is to be performed.
- method: The clustering technique used, in this case, "k-means". It's a method to group data points (in this context, failure modes) into a specified number of clusters.
- embedding_col: The column containing embedding vectors to be used for clustering. In this case, 'summary_vector' suggests that embeddings of the model's summaries are used.
- level: The level at which clustering is applied. The value "image" is likely a placeholder or context-specific term and should correspond to the granularity of the data being clustered.
- args: Additional arguments for the clustering method, such as the number of clusters ("numOfClusters": 5).
failure_mode_analysis_llm(): This function sets up the FMA specific to Language Models (LLMs) within the test environment.
- test_session: Specifies the test session in which the FMA is conducted. It's the environment where the analysis is executed.
- dataset_name: The name of the dataset to be used for the analysis.
- test_name: A unique identifier for this particular test run, in this case, "fma_llm_1".
- model: The identifier of the Language Model being tested, here referred to as "modelA".
- gt: Short for "ground truth", this parameter refers to a standard or benchmark against which the model’s output is compared. It could be a dataset or a model providing expected results.
- rules: The set of rules defined earlier for evaluating the model's performance. These rules determine how the model's output is assessed against various linguistic metrics.
- type: Specifies the type of analysis to be performed. In this context, "fma" indicates that it's a Failure Mode Analysis.
- output_type: This parameter likely indicates the type of output the model generates, which in this case is "llm", referring to Language Model outputs.
- prompt_col_name: The name of the column in the dataset that contains the prompts or inputs given to the model. Here, it's named "document".
- model_column: The column in the dataset that contains the outputs generated by the model, in this case, "summary".
- gt_column: The column containing the ground truth data or reference summaries against which the model's output is compared, named "reference_summary".
- embedding_col_name: The name of the column that contains the vector representations (embeddings) of the documents or prompts.
- model_embedding_column: This column contains the vector representations of the model's summaries. It is used for analyses that require understanding the semantic space of the model's outputs.
- gt_embedding_column: Contains the vector representations of the ground truth or reference summaries.
- clustering: References the clustering configuration set earlier. This is used to group similar types of failures or performance issues identified during the FMA.
test_session.add(): Registers the labelling quality test with the session.
test_session.run(): Starts the execution of all tests in the session, including your labelling quality test.

This setup in RagaAI helps you effectively analyse and identify failure modes in LLMs, providing valuable insights for model improvement and refinement."

Analysing Test Results

Clustering (Embedding View)

Navigate to Embedding View: Find the interactive visualisation interface in RagaAI.
Visualise Clusters: Utilise the tool to visualise complex clustering, uncovering hidden patterns and structures in the data.
Select Data Points: Use the lasso tool to select specific data points of interest and analyse their results.

Visualising Data

Grid View: Access the grid view to see data points within the selected clusters.
Data Filtering: Use this feature to focus on specific subsets of your dataset that meet certain conditions, helping to extract meaningful patterns and trends.

Analysing Test Results

Clustering (Embedding View)

Navigate to Embedding View: Find the interactive visualisation interface in RagaAI.
Visualise Clusters: Utilise the tool to visualise complex clustering, uncovering hidden patterns and structures in the data.
Select Data Points: Use the lasso tool to select specific data points of interest and analyse their results.

Visualising Data

Data Filtering: Use this feature to focus on specific subsets of your dataset that meet certain conditions, helping to extract meaningful patterns and trends.

Understanding Clustering and Threshold Breaches

Cluster Analysis: RagaAI creates clusters using the embeddings of the dataset. These clusters group similar data points together.
Identifying Underperforming Clusters: Clusters that underperform (breaching the threshold on accuracy) will be highlighted on the Issue Stats screen.

Navigating and Interpreting Results

Directly Look at Problematic Clusters: Users can quickly identify clusters responsible for underperformance and assess their impact on the overall model.
In-Depth Analysis: Dive deeper into specific clusters or data points to understand the root causes of underperformance.

Data Analysis

Switch to Analysis Tab: To get a detailed performance report, go to the Analysis tab.
View Performance Metrics: Examine performance metrics on distribution, document size and temporal graphs.

Practical Tips

Set Realistic Thresholds: Choose thresholds that reflect the expected performance of your model.

By following these steps, users can efficiently leverage the Failure Mode Analysis test to gain a comprehensive understanding of their model's performance, identify key areas for improvement, and make data-driven decisions to enhance model accuracy and reliability.

PreviousLarge Language Model (LLM)NextSemantic Segmentation

Last updated 1 month ago

Was this helpful?