# Synthetic Data Generation

{% hint style="info" %}
Exclusive to enterprise customers. [Contact us](https://calendly.com/nirmalya-raga/30min?month=2025-09) to activate this feature.
{% endhint %}

RagaAI offers a powerful **Synthetic Data Generation** feature, designed to streamline and enhance the process of building and evaluating large language models (LLMs). This feature enables users to generate use-case-specific *golden datasets* tailored to their applications by leveraging advanced techniques and a given context document.

<figure><img src="/files/FSh6PKhhXRJPmuvCNGyH" alt=""><figcaption></figcaption></figure>

The system can generate synthetic data for various applications, such as chatbot development, customer service automation, document summarisation, or code generation.

#### Models Supported:

* Groq
* Gemini
* OpenAI

#### Documents Supported:

* PDF
* Text
* Markdown
* CSV

#### Question Types Supported:

* Simple
* MCQ
* Complex

### Steps to generate Synthetic Dataset:<br>

1. Inside a Project, select "generate synthetic data" option

<figure><img src="/files/xi7BZpAUsEqZQopE0zYB" alt=""><figcaption></figcaption></figure>

2. Use a unique dataset name, upload relevant context documents, configure question types, select the LLM model (ensuring the context stays within the model's token limit), specify the desired number of rows, and generate the dataset.

<figure><img src="/files/uIfdM8Shfjy3nNPvdAsB" alt=""><figcaption></figcaption></figure>

3. The generated dataset will appear under the "Dataset" tab with the assigned name.

<figure><img src="/files/y0tg1dKWCsQ9kWhWRbZ6" alt=""><figcaption></figcaption></figure>

### Steps to generate data using SDK:

```python
from ragaai_catalyst import SyntheticDataGeneration
synthetic_data_generation = SyntheticDataGeneration()

# Provide your context file
text_file = "your-context-file-path"
text = synthetic_data_generation.process_document(input_data=text_file)

# For simple questions
result1 = synthetic_data_generation.generate_qna(text, question_type ='simple',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For complex questions
result2 = synthetic_data_generation.generate_qna(text, question_type ='complex',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For MCQ questions
result3 = synthetic_data_generation.generate_qna(text, question_type ='mcq',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)

print(result1.head())
```

This feature provides a critical advantage by reducing the manual effort required to create and test datasets, speeding up the development and evaluation cycle for LLMs, and ensuring that the datasets are specifically aligned with the user’s goals.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.raga.ai/ragaai-catalyst/synthetic-data-generation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
