# Self-Hosting RagaAI Catalyst on Kubernetes

This guide will walk you through deploying the Catalyst platform to an existing Kubernetes cluster using Helm, with support for Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), and Amazon Elastic Kubernetes Service (EKS).

### Supported Kubernetes Distributions

Catalyst has been successfully tested on the following Kubernetes distributions:

* Azure Kubernetes Service (AKS)
* Google Kubernetes Engine (GKE)
* Amazon Elastic Kubernetes Service (EKS)

### Prerequisites

Ensure the following tools and resources are ready:

* A working Kubernetes cluster (AKS, GKE, or EKS) accessible via `kubectl`, meeting these minimum requirements:
  * Kubernetes version 1.28 or higher
  * At least 3 nodes, each with:
    * 8 vCPUs
    * 16 GiB RAM
  * Recommended: Use a cluster autoscaler to dynamically scale nodes based on resource usage
  * Recommended: Install the metrics server to enable autoscaling
  * Catalyst uses Elasticsearch, Kibana, Redis (caching), all requiring persistent storage.
  * Verify storage class availability by running:

    ```plaintext
    kubectl get storageclass
    ```

    Example output:

    ```plaintext
    NAME                   PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
    default (default)      disk.csi.azure.com         Delete          WaitForFirstConsumer   true                   120d
    ```
* Helm
  * Install Helm version 3.13 or higher.
  * See the [official Helm documentation](https://helm.sh/docs/intro/install/) for instructions
* Docker Personal Access Token (PAT)
  * Obtain this from your RagaAI representative.
  * Contact <support@ragaai.com> for details
* Ingress
  * Nginx Ingress is recommended for managing external traffic to Catalyst services

#### Cloud Setup

> **Note:** Secure connection between your Kubernetes cluster and cloud resources is achieved using cloud-native identity mechanisms:
>
> * **AWS:** IAM Roles for Service Accounts (IRSA)
> * **Azure:** Federated Identity Credential
> * **GCP:** Workload Identity

**AWS**

* **S3 Bucket Setup**:
  1. Create an S3 Bucket for object storage.
  2. Set up IRSA (IAM Roles for Service Accounts)
     1. Role Name: `raga-role`
     2. Permissions: Access to the S3 bucket
     3. Trust relationship: EKS OIDC provider
     4. Service account: `system:serviceaccount:raga:raga-role`
  3. Configure CORS settings:
     * Allowed Methods: GET, PUT
     * Allowed Origins: \*
     * Allowed Headers: \*
     * Exposed Headers: none
     * Max Age: 3000 seconds
* **Database Setup**:
  * Database Version: MySQL 8.0 or later
  * Network Configuration: Deploy the database within the same VPC as your EKS cluster using a private endpoint
  * Storage: At least 50GB of SSD storage with automatic storage scaling enabled
  * Connectivity: Ensure EKS nodes can access the MySQL.

**Azure**

* **Azure Blob Storage Setup**:
  1. Create an Azure Blob Storage Account.
  2. Create an Azure Storage Container within the Blob Storage Account.
  3. Configure CORS:
     * Allowed Methods: GET, PUT
     * Allowed Origins: \*
     * Allowed Headers: \*
  4. Set up Federated Identity Credential (Azure AD Workload Identity) for secure access from Kubernetes:
     * Assign the following roles to the Service Principal for the storage account:
       * Storage Blob Data Contributor
     * Kubernetes Service Account: `raga-role`
  5. Enable Azure Workload Identity in your AKS cluster.
* **Database (MySQL) Requirements:**
  * Database Version: MySQL 8.0 or later
  * Network Configuration: Deploy within the same VNet as your AKS cluster using private endpoint
  * Storage: At least 50GB SSD storage with automatic storage increase enabled
  * Ensure the AKS Nodes can access the Azure Database for MySQL Server

**GCP**

* **Google Cloud Storage Setup**:
  1. Create a GCS Bucket for object storage.
  2. Configure CORS with the following settings:
     * Allowed Methods: GET, PUT
     * Allowed Origins: \* (all origins)
     * Allowed Headers: \* (all headers)
  3. Set up Workload Identity to access this bucket from GKE:
     * Create a Google Service Account and grant it the following roles:
       * roles/storage.admin
       * roles/storage.objectAdmin
       * roles/iam.serviceAccountTokenCreator
     * GKE Namespace: `raga`
     * GKE Service Account: `raga-role`
     * Bind the GKE service account to the Google service account using Workload Identity.
  4. Enable Workload Identity and the GCE Persistent Disk CSI Driver in your GKE cluster.
* **Database (MySQL) Requirements:**
  * Database Version: MySQL 8.0 or later
  * Network Configuration: Deploy within the same VPC as your GKE cluster using private IP
  * Storage: At least 50GB SSD storage with automatic storage increase enabled
  * Set `SSL_mode = "ALLOW_UNENCRYPTED_AND_ENCRYPTED"`
  * Ensure the GKE Nodes can access the CloudSQL Server.

### Configuration

#### Docker Hub Access

To deploy Catalyst, you must configure access to private Docker Hub repositories hosted by RagaAI.

1. **Obtain Docker PAT**:
   * Contact your RagaAI representative at <support@ragaai.com> to obtain a Docker Hub Personal Access Token (PAT).
2. **Log in to Docker Hub**:
   * Use the provided PAT to authenticate with Docker Hub:

     ```plaintext
     docker login -u ragaai -p <docker-pat>
     ```

#### Firewall Rules

* **Inbound Ports**:
  * Port 80 (HTTP): Required at the Load Balancer for accessing APIs and UI
* **Outbound Ports**:
  * Port 443 (HTTPS): Required if connecting to public LLMs (e.g., OpenAI, Anthropic). If not needed, deploy local models within the network
  * SMTP (Optional): Required for email alerts

### Deploying to Kubernetes

1. **Networking & Traffic Management**
   * Ensure the Nginx Ingress Controller pods are up and running to manage external traffic. Verify by running:

     ```plaintext
     kubectl get pods -n ingress-nginx
     ```

     Refer to the [Nginx Ingress installation instructions](https://kubernetes.github.io/ingress-nginx/deploy/) for details on setup and troubleshooting.
2. Deploy the Catalyst initialization Helm chart:

   ```plaintext
   helm install raga-init oci://registry-1.docker.io/ragaai/raga-init \
     --version 0.1.0 \
     --set dockerpat=<docker-pat>
   ```

   * Successful output example:

     ```plaintext
     NAME: raga-init
     LAST DEPLOYED: Thu Jun 26 15:34:00 2025
     NAMESPACE: default
     STATUS: deployed
     REVISION: 1
     TEST SUITE: None
     ```
   * Verify:

     ```plaintext
     kubectl get ns raga
     kubectl get secret regcred -n raga
     ```
3. Deploy the Catalyst Helm chart:

   ```plaintext
   helm install raga-catalyst oci://registry-1.docker.io/ragaai/raga-catalyst \
     --version 0.1.0 \
     -n raga \
     --set releaseTag=<release-tag> \
     --set storageClass=<storage-class> \
     --set endpoint=<http://loadbalancer-endpoint> \
     --set mysql.host=<mysql-host> \
     --set mysql.user=<mysql-user> \
     --set mysql.password=<mysql-password>
   ```

   * Replace `<mysql-host>`, `<mysql-user>`, and `<mysql-password>` with your MySQL instance details.
   * These parameters are required for connecting Catalyst to your external MySQL database.

| AWS Parameters             | Azure Parameters                | GCP Parameters          |
| -------------------------- | ------------------------------- | ----------------------- |
| - EKSclusterName           | - AzureBlobStorageName          | - GcpServiceAccountName |
| - ClusterAutoscalerRoleARN | - AzureBlobStorageContainerName | - GcsBucketname         |
| - AWSRoleARN               | - azWorkloadIdentity.tenantId   |                         |
| - S3BucketName             | - azWorkloadIdentity.clientId   |                         |
| - AWSRegion                |                                 |                         |

* Based on your cloud environment, you must set only the parameters relevant to your provider during Helm installation.
* Successful output example:

  ```plaintext
  NAME: raga-catalyst
  LAST DEPLOYED: Thu Jun 26 15:35:00 2025
  NAMESPACE: raga
  STATUS: deployed
  REVISION: 1
  TEST SUITE: None
  ```
* It may take a few minutes to create Kubernetes resources and initialize services
* Check pods:

  ```plaintext
  kubectl get pods -n raga
  ```

  Example output:

  ```plaintext
  litellm-76bd8cdd67-4brtw                          1/1     Running   0   22h
  llm-data-loader-5d4858fcc5-n6qj8                  1/1     Running   0   22h
  llm-platform-api-79d44b7b6d-hsfdz                 1/1     Running   0   61m
  llm-platform-esservice-74b4cf4876-gl7bf           1/1     Running   0   22h
  llm-platform-operators-869959f965-gdpjc           1/1     Running   0   22h
  llm-platform-raga-catalyst-sdk-66f5bcc494-fwxfs   1/1     Running   0   22h
  llm-platform-status-updater-7bd749b98f-tn55b      1/1     Running   0   22h
  llm-platform-ui-784c69c459-wqq26                  1/1     Running   0   7h20m
  ```

### Validate your Deployment

1. Run:

   ```plaintext
   kubectl get services -n raga
   ```

   Example output:

   ```plaintext
   litellm                                  ClusterIP   10.103.99.23     <none>        80/TCP                          22h
   llm-data-loader                          ClusterIP   10.96.198.238    <none>        80/TCP                          22h
   llm-platform-api                         ClusterIP   10.99.5.206      <none>        80/TCP                          64m
   llm-platform-api-nodeport                NodePort    10.109.130.90    <none>        80:31200/TCP                    64m
   llm-platform-esservice                   ClusterIP   10.96.80.164     <none>        80/TCP                          22h
   llm-platform-operators                   ClusterIP   10.98.17.72      <none>        80/TCP                          22h
   llm-platform-raga-catalyst-sdk           ClusterIP   10.111.108.37    <none>        80/TCP                          22h
   llm-platform-status-updater              ClusterIP   10.107.89.124    <none>        80/TCP                          22h
   llm-platform-ui                          ClusterIP   10.105.47.0      <none>        80/TCP                          7h23m
   ```
2. Access the platform using the external IP of the `raga-catalyst-frontend` service:

   ```plaintext
   curl <external-ip>/api/healthcheck
   ```

   Expected output:

   ```plaintext
   {"status":"healthy"}
   ```
3. Visit the external IP in your browser to confirm the Catalyst UI is operational
   * Example: `http://<external-ip>`

### Final Notes

* Ensure proper IAM permissions for your cloud provider's storage, Kubernetes service, and Helm deployments
* Monitor cluster health using ELK and `kubectl logs`
* Check Helm release status:

  ```plaintext
  helm list -n raga
  helm status raga-catalyst -n raga
  ```
* For issues, contact the RagaAI team at <support@ragaai.com>
