Synthetic Data Generation (Beta)

Use LLMs to generate numerous synthetic prompts; currently supported via SDK only.

Exclusive to enterprise customers. Contact us to activate this feature.

RagaAI offers a powerful Synthetic Data Generation feature, designed to streamline and enhance the process of building and evaluating large language models (LLMs). This feature enables users to generate use-case-specific golden datasets tailored to their applications by leveraging advanced techniques and a given context document.

The system can generate synthetic data for various applications, such as chatbot development, customer service automation, document summarisation, or code generation.

Models Supported:

  • Groq

  • Gemini

  • OpenAI

Documents Supported:

  • PDF

  • Text

  • Markdown

  • CSV

Question Types Supported:

  • Simple

  • MCQ

  • Complex

from ragaai_catalyst import SyntheticDataGeneration
synthetic_data_generation = SyntheticDataGeneration()

# Provide your context file
text_file = "your-context-file-path"
text = synthetic_data_generation.process_document(input_data=text_file)

# For simple questions
result1 = synthetic_data_generation.generate_qna(text, question_type ='simple',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For complex questions
result2 = synthetic_data_generation.generate_qna(text, question_type ='complex',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For MCQ questions
result3 = synthetic_data_generation.generate_qna(text, question_type ='mcq',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)

print(result1.head())

This feature provides a critical advantage by reducing the manual effort required to create and test datasets, speeding up the development and evaluation cycle for LLMs, and ensuring that the datasets are specifically aligned with the user’s goals.

Last updated