Community

NVIDIA GPU Operator#
The NVIDIA GPU Operator for Kubernetes is a powerful tool that simplifies the management of GPUs in your Kubernetes clusters. It automates the deployment and configuration of all the necessary software components to enable GPUs, making it easier to run GPU-accelerated workloads like AI/ML training and high-performance computing.
Looking for Commercial Support? LEARN MORE
Prerequisites#
Deploy k0rdent v1.4.0: QuickStart
Install template to k0rdent#
helm upgrade --install gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=gpu-operator:24.9.2" -n kcm-system
Verify service template#
Deploy service template#
Tested ClusterDeployment in AWS:
apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
name: aws-example-cluster
labels:
group: demo
spec:
template: aws-standalone-cp-1-0-14
credential: aws-credential
config:
controlPlane:
instanceType: t3.small
controlPlaneNumber: 1
publicIP: false
region: eu-central-1
worker:
# Small AWS instance with NVIDIA GPU
instanceType: g4dn.xlarge
# AMI Catalog - Community AMIs:
# Find region specific AMI ID with title:
# "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server"
# eu-central-1: ami-0162f0739222cca1c, us-east-2: ami-00eb69d236edcfaf8, ap-south-1: ami-0b738b0c888af81f7
amiID: "ami-0162f0739222cca1c"
imageLookup: {org: "", format: "", baseOS: ""}
rootVolumeSize: 100
workersNumber: 1
Tested ClusterDeployment in Azure:
apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
name: azure-example-cluster
labels:
group: demo
spec:
template: azure-standalone-cp-1-0-15
credential: azure-credential
config:
controlPlaneNumber: 1
workersNumber: 1
location: "westus"
subscriptionID: SUBSCRIPTION_ID_SUBSCRIPTION_ID # Enter the Subscription ID used earlier
controlPlane:
vmSize: Standard_A4_v2
worker:
image:
marketplace:
publisher: "Canonical"
offer: "0001-com-ubuntu-minimal-jammy"
sku: "minimal-22_04-lts"
version: "22.04.202502270"
rootVolumeSize: 32
# Small Azure instance with NVIDIA GPU
vmSize: Standard_NC4as_T4_v3
Operator deployment:
apiVersion: k0rdent.mirantis.com/v1beta1
kind: MultiClusterService
metadata:
name: gpu-operator
spec:
clusterSelector:
matchLabels:
group: demo
serviceSpec:
services:
- template: gpu-operator-24-9-2
name: gpu-operator
namespace: gpu-operator
values: |
gpu-operator:
operator:
defaultRuntime: containerd
toolkit:
env:
- name: CONTAINERD_CONFIG
value: /etc/k0s/containerd.d/nvidia.toml
- name: CONTAINERD_SOCKET
value: /run/k0s/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
# optionally, create DCGM-Exporter ServiceMonitor
dcgmExporter:
serviceMonitor:
enabled: true
interval: 15s
honorLabels: true
additionalLabels: {}