NVIDIA Run:ai#
NVIDIA Run:ai accelerates AI and machine learning operations by addressing key infrastructure challenges through dynamic resource allocation, comprehensive AI life-cycle support, and strategic resource management. By pooling resources across environments and utilizing advanced orchestration, NVIDIA Run:ai significantly enhances GPU efficiency and workload capacity. With support for public clouds, private clouds, hybrid environments, or on-premises data centers, NVIDIA Run:ai provides unparalleled flexibility and adaptability.
When deployed via k0rdent, Run:ai boots fully automatically with all necessary prerequisites. In addition it can be configured to use letsencrypt to automatically issuing certificates. You will need to configure DNS after all
Looking for Commercial Support? LEARN MORE
Note!
You need to bring your own clustertemplate that fulfills all necessary requirements like node tagging. You'll find an example template that you can use in the deploy descriptions.
NVIDIA Run:ai prerequisites:
- Istio or Cilium (The example clustertemplate uses Cilium and we test with Cilium CNI)
- K0rdent - 1.4.0+
- NVIDIA Run:ai license
- External-DNS
- NVIDIA GPU Operator
- CertManager
- NGINX Ingress
- KNative-Operator
- Prometheus Stack
-
CSI & storageclass with ReadWriteMany capabilities (AWS ebs CSI is fine for PoC)
-
Node Labels:
- GPU: runai-gpu-worker
- System: runai-system
- CPU: runai-cpu-worker
Note!
The KNative operator enables the deployment and configuration of KNative-serving.
Host-based routing (subdomain delegation to inference endpoints) requires a wildcard certificate. Within this example deployment we use cert-manager with letsencrypt and DNS01 challenge.
Please refer for more requirements the official NVIDIA Run:ai documentation
Tested on:
| Run:ai version | Provider | Kubernetes | GPU | k0rdent | OS | Mirantis Meta Helm chart |
|---|---|---|---|---|---|---|
| 2.22.x | AWS | v1.32, v1.33 | L40s | 1.4.0 | Ubuntu 22.04 | 0.2.3 |
| 2.23.x | AWS | v1.33, v1.34 | L40s | 1.4.0, 1.5.0 | Ununtu 22.04 | 0.3.0 |
Install template to k0rdent#
helm upgrade --install runai-control-plane oci://registry.mirantis.com/k0rdent-enterprise-catalog/kgst --set "chart=runai-control-plane:0.3.0" -n kcm-system
helm upgrade --install external-dns oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=external-dns:1.19.0" -n kcm-system
helm upgrade --install gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=gpu-operator:25.10.0" -n kcm-system
helm upgrade --install ingress-nginx oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=ingress-nginx:4.13.2" -n kcm-system
helm upgrade --install cert-manager oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=cert-manager:1.18.2" -n kcm-system
helm upgrade --install knative-operator oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=knative-operator:1.19.0" -n kcm-system
helm upgrade --install kube-prometheus-stack oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=kube-prometheus-stack:77.0.0" -n kcm-system
Verify service template#
kubectl get servicetemplates -A
# NAMESPACE NAME VALID
# kcm-system external-dns-1-19-0 true
# kcm-system runai-control-plane-0-3-0 true
# kcm-system gpu-operator-25-10-0 true
# kcm-system ingress-nginx-4-13-2 true
# kcm-system cert-manager-1-18-2 true
# kcm-system knative-operator-1-19-0 true
# kcm-system kube-prometheus-stack-77-0-0 true
Deploy service template#
Run:ai config#
kind: ConfigMap
apiVersion: v1
metadata:
name: runai-control-plane
namespace: kcm-system
data:
runai-values: |
license: <BASE64 encoded Run:ai license>
clusterIssuer:
name: letsencrypt-prod
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- dns01:
route53:
region: eu-north-1
accessKeyIDSecretRef:
name: prod-route53-credentials-secret # Secret you need to create upfront on child or via servicetemplate and secret delegation
key: AccessKeyID
secretAccessKeySecretRef:
name: prod-route53-credentials-secret # Secret you need to create upfront on child or via servicetemplate and secret delegation
key: SecretAccessKey
control-plane:
global:
domain: runai.example.com
ingress:
extraAnnotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
tenantsManager:
config:
adminUsername: "test@run.ai" # It is highly recommended to change this
adminPassword: "Abcd!234" # It is highly recommended to change this
knative:
ingress:
className: nginx
domain:
runai.example.com: ""
NVIDIA Operator config#
kind: ConfigMap
apiVersion: v1
metadata:
name: service-gpu-operator-values
namespace: kcm-system
data:
values: |
gpu-operator:
operator:
defaultRuntime: containerd
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /run/k0s/containerd-cri.toml
- name: RUNTIME_DROP_IN_CONFIG
value: /etc/k0s/containerd.d/nvidia.toml
- name: CONTAINERD_SOCKET
value: /run/k0s/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
# optionally, create DCGM-Exporter ServiceMonitor
dcgmExporter:
serviceMonitor:
enabled: true
interval: 15s
honorLabels: true
additionalLabels: {}
External-DNS AWS secret#
kind: Secret
apiVersion: v1
metadata:
name: service-external-dns-values
namespace: kcm-system
type: addons.projectsveltos.io/cluster-profile # This is necessary for Sveltos secret delegation
stringData:
values: |
provider:
name: aws
env:
- name: AWS_REGION
value: 'AWS_REGION' # Change me
- name: AWS_ACCESS_KEY_ID
value: 'AWS_ACCESS_KEY_ID' # Change me
- name: AWS_SECRET_ACCESS_KEY
value: '' # Change me
Cert-Manager AWS secret#
apiVersion: v1
kind: Secret
metadata:
name: cert-manager-aws-secret
namespace: kcm-system
type: addons.projectsveltos.io/cluster-profile
stringData:
secret.yaml: |
apiVersion: v1
kind: Secret
metadata:
name: prod-route53-credentials-secret
namespace: cert-manager
type: Opaque
stringData:
AccessKeyID: '' # Change me
SecretAccessKey: '' # Change me
---
apiVersion: k0rdent.mirantis.com/v1beta1
kind: ServiceTemplate
metadata:
name: cert-manager-aws-secret
namespace: kcm-system
spec:
resources:
localSourceRef:
kind: Secret
name: cert-manager-aws-secret
deploymentType: Remote
path: "" # will be ignored
MultiClusterService#
apiVersion: k0rdent.mirantis.com/v1beta1
kind: MultiClusterService
metadata:
name: runai-control-plane
spec:
clusterSelector:
matchLabels:
group: runai-cp
serviceSpec:
services:
- name: gpu-operator
namespace: gpu-operator
template: gpu-operator-25-10-0
valuesFrom:
- kind: ConfigMap
name: service-gpu-operator-values
- name: nginx
namespace: ingress-nginx
template: ingress-nginx-4-13-2
- name: external-dns
namespace: external-dns
template: external-dns-1-19-0
valuesFrom:
- kind: Secret
name: service-external-dns-values
- name: prod-route53-credentials-secret
namespace: cert-manager
template: cert-manager-aws-secret
- name: cert-manager
namespace: cert-manager
template: cert-manager-1-18-2
values: |
cert-manager:
crds:
enabled: true
- name: prometheus
namespace: monitoring
template: kube-prometheus-stack-77-0-0
values: |
grafana:
enabled: false
- name: knative-operator
namespace: knative-operator
template: knative-operator-1-19-0
- name: runai-backend
namespace: runai-backend
template: runai-service-template-cp-0.3.0
valuesFrom:
- kind: ConfigMap
name: runai-control-plane
Please Note!
The Helm Chart to deploy NVIDIA Run:ai is only a meta helm chart that depends on the NVIDIA Run:ai upstream helm chart. It requires CRDs of the other listed services above and cannot be installed alone. The dependencies are listed in the MultiClusterService not in the chart itself!
Install template to k0rdent#
This example clustertemplate uses Cilium CNI and requires the operator.
helm upgrade --install cilium oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=cilium:1.18.2" -n kcm-system
Verify service template#
Example clustertemplate#
The clustertemplate is an example and you can create your own based on your needs. This should just speed up your deployment.
Clustertemplate HelmRepository#
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
labels:
k0rdent.mirantis.com/managed: 'true'
name: mirantis-templates
namespace: kcm-system
spec:
interval: 1m0s
type: oci
url: oci://registry.mirantis.com/k0rdent-ai/example/charts
Clustertemplate#
apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterTemplate
metadata:
name: aws-generic-cilium-clustertemplate-0-1-0
namespace: kcm-system
spec:
helm:
chartSpec:
chart: aws-cp-standalone-generic-cilium
interval: 10m0s
sourceRef:
kind: HelmRepository
name: mirantis-templates
version: 0.1.0
Cilium config values#
kind: ConfigMap
apiVersion: v1
metadata:
name: service-cilium-values
namespace: kcm-system
data:
cilium-values: |
cilium:
cluster:
name: cilium
hubble:
tls:
enabled: false
auto:
method: helm
certManagerIssuerRef: {}
ui:
enabled: false
ingress:
enabled: false
relay:
enabled: false
ipv4:
enabled: true
ipv6:
enabled: false
envoy:
enabled: false
egressGateway:
enabled: false
kubeProxyReplacement: "true"
serviceAccounts:
cilium:
name: cilium
operator:
name: cilium-operator
localRedirectPolicy: true
ipam:
mode: cluster-pool
operator:
clusterPoolIPv4PodCIDRList:
- "192.168.224.0/20"
- "192.168.210.0/20"
clusterPoolIPv6PodCIDRList:
- "fd00::/104"
tunnelProtocol: geneve
Clusterdeployment example#
Attention!
You need to update all necessary parameters, like machine flavors, key names, etc.
This clusterdeployment provides a very small footprint, it doesn't provide any HA. You would need to increasae the number of nodes to distribute pods properly.
apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
name: runai-cp
namespace: kcm-system
spec:
ipamClaim: {}
propagateCredentials: true
credential: aws-cluster-identity-cred # Update if necesarry
template: aws-generic-cilium-clustertemplate-0-1-0
config:
awscluster:
network:
cni:
cniIngressRules:
- description: vxlan (cilium)
fromPort: 8472
protocol: udp
toPort: 8472
- description: geneve (cilium)
fromPort: 6081
protocol: udp
clusterLabels:
group: runai-cp
type: aws
bastion:
enabled: true
disableIngressRules: false
allowedCIDRBlocks: []
instanceType: t3.small
ami: "ami-0becc523130ac9d5d"
workers:
- name: cpu
type: worker
preStartCommands:
- "apt update"
- "apt install nfs-common -y"
labels:
node-role.kubernetes.io/runai-cpu-worker: "true"
amiID: ami-0becc523130ac9d5d # EU-north-1 - Ubuntu 22.04
instanceType: t3.2xlarge
number: 1
rootVolumeSize: 120
publicIP: false
iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
- name: system
type: worker
preStartCommands:
- "apt update"
- "apt install nfs-common -y"
labels:
node-role.kubernetes.io/runai-system: "true"
amiID: ami-0becc523130ac9d5d
instanceType: t3.2xlarge
number: 1
rootVolumeSize: 120
publicIP: false
iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
- name: worker
type: gpu
preStartCommands:
- "apt update"
- "apt install nfs-common -y"
labels:
node-role.kubernetes.io/runai-gpu-worker: "true"
node.kubernetes.io/role: runai-gpu-worker
nvidia.com/gpu: "true"
amiID: ami-0becc523130ac9d5d
instanceType: g6e.2xlarge
iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
number: 1
rootVolumeSize: 120
publicIP: true
controlPlane:
amiID: ami-0becc523130ac9d5d
instanceType: c5.xlarge
number: 1
rootVolumeSize: 100
k0s:
version: v1.34.2+k0s.0
network: # Disable default network, to install cilium CNI
calico: null
provider: custom
kubeProxy:
disabled: true
region: eu-north-1 # Update if necesarry
sshKeyName: runai # Update if necesarry
clusterIdentity:
name: "aws-cluster-identity"
kind: "AWSClusterStaticIdentity"
serviceSpec:
stopOnConflict: false
syncMode: Continuous
continueOnError: true
priority: 100
services:
- name: cilium
namespace: kube-system
template: cilium-1-18-2
valuesFrom:
- kind: ConfigMap
name: service-cilium-values
values: |
cilium:
k8sServiceHost: {{ .Cluster.spec.controlPlaneEndpoint.host }}
k8sServicePort: {{ .Cluster.spec.controlPlaneEndpoint.port }}