Skip to content
Community

logo

NVIDIA GPU Operator#

The NVIDIA GPU Operator for Kubernetes is a powerful tool that simplifies the management of GPUs in your Kubernetes clusters. It automates the deployment and configuration of all the necessary software components to enable GPUs, making it easier to run GPU-accelerated workloads like AI/ML training and high-performance computing.


Looking for Commercial Support? LEARN MORE

Prerequisites#

Deploy k0rdent v1.4.0: QuickStart

Install template to k0rdent#

helm upgrade --install gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=gpu-operator:24.9.2" -n kcm-system

Verify service template#

kubectl get servicetemplates -A
# NAMESPACE    NAME                            VALID
# kcm-system   gpu-operator-24-9-2             true

Deploy service template#

Tested ClusterDeployment in AWS:

apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
  name: aws-example-cluster
  labels:
    group: demo
spec:
  template: aws-standalone-cp-1-0-14
  credential: aws-credential
  config:
    controlPlane:
      instanceType: t3.small
    controlPlaneNumber: 1
    publicIP: false
    region: eu-central-1
    worker:
      # Small AWS instance with NVIDIA GPU
      instanceType: g4dn.xlarge
      # AMI Catalog - Community AMIs:
      #  Find region specific AMI ID with title:
      #  "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server"
      #  eu-central-1: ami-0162f0739222cca1c, us-east-2: ami-00eb69d236edcfaf8, ap-south-1: ami-0b738b0c888af81f7
      amiID: "ami-0162f0739222cca1c"
      imageLookup: {org: "", format: "", baseOS: ""}
      rootVolumeSize: 100
    workersNumber: 1

Tested ClusterDeployment in Azure:

apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
  name: azure-example-cluster
  labels:
    group: demo
spec:
  template: azure-standalone-cp-1-0-15
  credential: azure-credential
  config:
    controlPlaneNumber: 1
    workersNumber: 1
    location: "westus"
    subscriptionID: SUBSCRIPTION_ID_SUBSCRIPTION_ID # Enter the Subscription ID used earlier
    controlPlane:
      vmSize: Standard_A4_v2
    worker:
      image:
        marketplace:
          publisher: "Canonical"
          offer: "0001-com-ubuntu-minimal-jammy"
          sku: "minimal-22_04-lts"
          version: "22.04.202502270"
      rootVolumeSize: 32
      # Small Azure instance with NVIDIA GPU
      vmSize: Standard_NC4as_T4_v3

Operator deployment:

apiVersion: k0rdent.mirantis.com/v1beta1
kind: MultiClusterService
metadata:
  name: gpu-operator
spec:
  clusterSelector:
    matchLabels:
      group: demo
  serviceSpec:
    services:
    - template: gpu-operator-24-9-2
      name: gpu-operator
      namespace: gpu-operator
      values: |
        gpu-operator:
          operator:
            defaultRuntime: containerd
          toolkit:
            env:
              - name: CONTAINERD_CONFIG
                value: /etc/k0s/containerd.d/nvidia.toml
              - name: CONTAINERD_SOCKET
                value: /run/k0s/containerd.sock
              - name: CONTAINERD_RUNTIME_CLASS
                value: nvidia
          # optionally, create DCGM-Exporter ServiceMonitor
          dcgmExporter:
            serviceMonitor:
              enabled: true
              interval: 15s
              honorLabels: true
              additionalLabels: {}