Synthetic Data Generation

Use LLMs to generate numerous synthetic prompts; currently supported via SDK only.

Exclusive to enterprise customers. Contact us to activate this feature.

RagaAI offers a powerful Synthetic Data Generation feature, designed to streamline and enhance the process of building and evaluating large language models (LLMs). This feature enables users to generate use-case-specific golden datasets tailored to their applications by leveraging advanced techniques and a given context document.

The system can generate synthetic data for various applications, such as chatbot development, customer service automation, document summarisation, or code generation.

Models Supported:

Groq
Gemini
OpenAI

Documents Supported:

PDF
Text
Markdown
CSV

Question Types Supported:

Simple
MCQ
Complex

Steps to generate Synthetic Dataset:

Inside a Project, select "generate synthetic data" option

Use a unique dataset name, upload relevant context documents, configure question types, select the LLM model (ensuring the context stays within the model's token limit), specify the desired number of rows, and generate the dataset.

The generated dataset will appear under the "Dataset" tab with the assigned name.

Steps to generate data using SDK:

from ragaai_catalyst import SyntheticDataGeneration
synthetic_data_generation = SyntheticDataGeneration()

# Provide your context file
text_file = "your-context-file-path"
text = synthetic_data_generation.process_document(input_data=text_file)

# For simple questions
result1 = synthetic_data_generation.generate_qna(text, question_type ='simple',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For complex questions
result2 = synthetic_data_generation.generate_qna(text, question_type ='complex',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For MCQ questions
result3 = synthetic_data_generation.generate_qna(text, question_type ='mcq',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)

print(result1.head())

This feature provides a critical advantage by reducing the manual effort required to create and test datasets, speeding up the development and evaluation cycle for LLMs, and ensuring that the datasets are specifically aligned with the user’s goals.

PreviousPrompt Diff View NextGateway

Last updated 8 months ago

Was this helpful?