# RAG Dataset

### Via UI:

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FQzYdeHGtsLBEqEYF8Pic%2Fdataset1.gif?alt=media&#x26;token=b923762b-930a-4247-8b22-507936fab90a" alt=""><figcaption><p>Upload Via CSV</p></figcaption></figure>

1. Open your **Project** from the Project list.
2. It will take you to the **Dataset** tab.
3. From the options to create new dataset, select **"Upload via CSV"** method
4. Click on the upload area and browse/drag and drop your local CSV file. Ensure the file size does not exceed 1GB.
5. Enter a suitable name and description \[optional] for your dataset.
6. Click **Next** to proceed.

Next, you will be directed to map your dataset schema with Catalyst's inbuilt schema, so that your column headings don't require editing.

Here is a list of Catalyst's inbuilt schema elements (definitions are for reference purposes and may vary slightly based on your use case):

<table><thead><tr><th width="210">Schema Element</th><th>Definition</th></tr></thead><tbody><tr><td>traceId</td><td>Unique ID associated with a trace</td></tr><tr><td>metadata</td><td>Any additional data not falling into a defined bucket. User has to define the type of metadata [numerical or categorical]</td></tr><tr><td>cost</td><td>Expense associated with generating a particular inference</td></tr><tr><td>expected_context</td><td>Context documents expected to be retrieved for a query</td></tr><tr><td>latency</td><td>Time taken for an inference to be returned</td></tr><tr><td>system_prompt</td><td>Predefined instruction provided to an LLM to shape its behaviour during interactions</td></tr><tr><td>traceUri</td><td>Unique identifier used to trace and log the sequence of operations during an LLM inference process</td></tr><tr><td>pipeline</td><td>Sequence of processes or stages that an input passes through before producing an output in LLM systems</td></tr><tr><td>response</td><td>Output generated by an LLM after processing a given prompt or query</td></tr><tr><td>context</td><td>Surrounding information or history provided to an LLM to inform and influence its responses</td></tr><tr><td>prompt</td><td>Input or query provided to an LLM that triggers the generation of a response</td></tr><tr><td>expected_response</td><td>Anticipated or ideal output that an LLM should produce in response to a given prompt</td></tr><tr><td>timestamp</td><td>Specific date and time at which an LLM action, such as an inference or a response, occurs</td></tr></tbody></table>

### Via SDK:

This guide provides a step-by-step explanation on how to use the RagaAI Python SDK to upload data to your project. The example demonstrates how to manage datasets and upload a CSV file into the platform. The following sections will cover initialisation, listing existing datasets, mapping schema, and uploading the CSV data.

#### 1. Prerequisites

* Ensure you have the RagaAI Python SDK installed. If not, you can install it using:

  ```bash
  pip install ragaai-catalyst
  ```
* You need secret key, access key and project name, which you can get by navigating to settings/authenticate on UI.

#### 2. Importing Required Modules

Import the `Dataset` module from the `ragaai_catalyst` library to handle the dataset operations.

```python
from ragaai_catalyst import Dataset
import pandas as pd
```

#### 3. Initialise Dataset Management

Initialise the dataset manager for a specific project. This will allow you to interact with the datasets in that project.

```python
# Initialize Dataset management for a specific project
dataset_manager = Dataset(project_name="demo_project")
```

Replace `"demo_project"` with your actual project name.

#### 4. List Existing Datasets

You can list all the existing datasets within your project to check what data is already available.

```python
# List existing datasets
datasets = dataset_manager.list_datasets()
print("Existing Datasets:", datasets)
```

This prints a list of existing datasets available in your project.

#### 5. Get the Schema Elements

Retrieve the supported schema elements from the project. This will help you understand how to map your CSV columns to the dataset schema.

```python
# Get the schema elements
schemaElements = dataset_manager.get_csv_schema()['data']['schemaElements']
print('Supported column names: ', schemaElements)
```

This step returns the available schema elements that can be used for mapping your CSV columns.

#### 6. Create the Schema Mapping

Create a dictionary to map your CSV column names to the schema elements supported by RagaAI. For example:

```python
pythonCopy code #Create the schema mapping accordingly
schema_mapping = {'sql_context': 'context', 'sql_prompt': 'prompt'}
```

In this case, the column `'sql_context'` in the CSV is mapped to `'context'` in the dataset, and `'sql_prompt'` is mapped to `'prompt'`.

#### 7. Upload the Dataset from CSV

Finally, use the `create_from_csv` function to upload the CSV data into the platform. Specify the CSV path, dataset name, and the schema mapping.

```python
# Create a dataset from CSV
dataset_manager.create_from_csv(
    csv_path='/content/synthetic_text_to_sql_gpt_4o_mini.csv',
    dataset_name='csv_upload31',
    schema_mapping=schema_mapping
)
```

Replace the `csv_path` and `dataset_name` with your CSV file path and desired dataset name, respectively.

#### 8. Verifying the Upload

After uploading, you can verify the upload by listing the datasets again or checking the project dashboard.

```python
# List datasets to verify the upload
datasets = dataset_manager.list_datasets()
print("Updated Datasets:", datasets)
```

#### 9. Verifying the Upload

Navigate to Dataset tab inside your project to explore your dataset and run evals

<figure><img src="https://1811327582-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FYbIiNdp1QbG4avl7VShw%2Fuploads%2FiDp4FpdIWeeFby5ezz55%2FScreenshot%202024-09-30%20at%203.57.47%E2%80%AFPM.png?alt=media&#x26;token=135abcd6-d094-46d5-af4a-9fe7f3d0f3fe" alt=""><figcaption><p>Uploaded Dataset</p></figcaption></figure>
