# Synthetic Data Generation

{% hint style="info" %}
Exclusive to enterprise customers. [Contact us](https://calendly.com/nirmalya-raga/30min?month=2025-09) to activate this feature.
{% endhint %}

RagaAI offers a powerful **Synthetic Data Generation** feature, designed to streamline and enhance the process of building and evaluating large language models (LLMs). This feature enables users to generate use-case-specific *golden datasets* tailored to their applications by leveraging advanced techniques and a given context document.

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FhLNznTEAnwhptnPKA6xA%2FScreenshot%202024-10-07%20at%208.36.01%E2%80%AFAM.png?alt=media&#x26;token=f29c6e6c-e6a1-45ff-b1b6-11bc316fab17" alt=""><figcaption></figcaption></figure>

The system can generate synthetic data for various applications, such as chatbot development, customer service automation, document summarisation, or code generation.

#### Models Supported:

* Groq
* Gemini
* OpenAI

#### Documents Supported:

* PDF
* Text
* Markdown
* CSV

#### Question Types Supported:

* Simple
* MCQ
* Complex

### Steps to generate Synthetic Dataset:<br>

1. Inside a Project, select "generate synthetic data" option

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FjXeYtxuomQH1e8w6mE3A%2FScreenshot%202024-11-20%20at%209.53.17%E2%80%AFPM.png?alt=media&#x26;token=e10a2360-3019-43c6-a91d-65a60e8fa466" alt=""><figcaption></figcaption></figure>

2. Use a unique dataset name, upload relevant context documents, configure question types, select the LLM model (ensuring the context stays within the model's token limit), specify the desired number of rows, and generate the dataset.

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2Fzge09NhFYXehBK5dQQXu%2FScreenshot%202024-11-20%20at%2010.07.55%E2%80%AFPM.png?alt=media&#x26;token=79dd60b5-2f2a-43ff-9f05-be14f65f91f8" alt=""><figcaption></figcaption></figure>

3. The generated dataset will appear under the "Dataset" tab with the assigned name.

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2F2mfp2fQhvKimlhTesd40%2FScreenshot%202024-11-20%20at%2010.15.38%E2%80%AFPM.png?alt=media&#x26;token=0ba53c36-89a4-4861-84bc-04e4176bbf73" alt=""><figcaption></figcaption></figure>

### Steps to generate data using SDK:

```python
from ragaai_catalyst import SyntheticDataGeneration
synthetic_data_generation = SyntheticDataGeneration()

# Provide your context file
text_file = "your-context-file-path"
text = synthetic_data_generation.process_document(input_data=text_file)

# For simple questions
result1 = synthetic_data_generation.generate_qna(text, question_type ='simple',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For complex questions
result2 = synthetic_data_generation.generate_qna(text, question_type ='complex',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)
# For MCQ questions
result3 = synthetic_data_generation.generate_qna(text, question_type ='mcq',model_config={"provider":"gemini","model":"gemini-1.5-flash","api_base":"your-api-base"},n=20)

print(result1.head())
```

This feature provides a critical advantage by reducing the manual effort required to create and test datasets, speeding up the development and evaluation cycle for LLMs, and ensuring that the datasets are specifically aligned with the user’s goals.
