Community

AMD GPU Operator#

OverviewInstall

Chart Name	Version	App version
amd-gpu-operator	v1.4.1	v1.4.1

k0rdent operates in conjunction with the AMD GPU Operator by providing a higher-level orchestration layer for Kubernetes clusters equipped with AMD GPUs.

While the AMD GPU Operator focuses on the intricacies within a single cluster—managing the GPU drivers, device plugins, and necessary configurations — k0rdent extends this management across multiple clusters. It leverages tools like Cluster API to simplify the creation and management of these GPU-enabled Kubernetes environments.

This synergy allows for efficient utilization of distributed GPU resources, centralized policy enforcement, and comprehensive observability for AI applications running on AMD hardware.

Looking for Commercial Support? LEARN MORE

Prerequisites#

Deploy k0rdent v1.4.0: QuickStart

Install template to k0rdent#

helm upgrade --install amd-gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=amd-gpu-operator:v1.4.1" -n kcm-system

Verify service template#

kubectl get servicetemplates -A
# NAMESPACE    NAME                            VALID
# kcm-system   amd-gpu-operator-v1-4-1         true

Deploy service template#

Tested ClusterDeployment in Azure:

apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
  name: aws-example-cluster
  namespace: kcm-system
  labels:
    group: demo
spec:
  template: azure-standalone-cp-1-0-15
  credential: azure-credential
  config:
    controlPlaneNumber: 1
    workersNumber: 1 
    location: "westus"
    subscriptionID: SUBSCRIPTION_ID # Enter the Subscription ID
    controlPlane:
      vmSize: Standard_A4_v2
    worker:
      image:
        marketplace:
          publisher: "microsoft-dsvm"
          offer: "ubuntu-hpc"
          sku: "2204-rocm" # Ubuntu 22.04 with AMD driver pre-installed
          version: "22.04.2025041101"
      rootVolumeSize: 500 # Big disk needed for testing images
      # 8x AMD Instinct MI300 GPUs (expensive!)
      vmSize: Standard_ND96is_MI300X_v5

Operator deployment:

apiVersion: k0rdent.mirantis.com/v1beta1
kind: MultiClusterService
metadata:
  name: amd-gpu
spec:
  clusterSelector:
    matchLabels:
      group: demo
  serviceSpec:
    services:
    - template: cert-manager-1-18-2
      name: cert-manager
      namespace: cert-manager
      values: |
        cert-manager:
          crds:
            enabled: true
    - template: amd-gpu-operator-v1-4-1
      name: amd-gpu-operator
      namespace: kube-amd-gpu

Manually create AMD GPU DeviceConfig CRD in the child cluster:

apiVersion: amd.com/v1beta1
kind: DeviceConfig
metadata:
  name: amd-gpu-operator
  namespace: kube-amd-gpu
spec:
  driver:
    enable: false
  devicePlugin:
    devicePluginImage: rocm/k8s-device-plugin:latest
    nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
    enableNodeLabeller: true
  selector:
    feature.node.kubernetes.io/amd-vgpu: "true"

Verification: Manually create a testing pod in child cluster:

apiVersion: v1
kind: Pod
metadata:
 name: amd-smi
spec:
 containers:
 - image: docker.io/rocm/pytorch:latest # 20GB image !
   name: amd-smi
   command: ["/bin/bash"]
   args: ["-c","amd-smi version && amd-smi monitor -ptum"]
   resources:
      limits:
        amd.com/gpu: 1
      requests:
        amd.com/gpu: 1
 restartPolicy: Never

Expected testing pod output (log):

AMDSMI Tool: 25.3.0+ede62f2 | AMDSMI Library version: 25.3.0 | ROCm version: 6.4.0 | amdgpu version: 6.8.5 | amd_hsmp version: N/A
GPU  POWER   GPU_T   MEM_T   GFX_CLK   GFX%   MEM%  MEM_CLOCK
  0  150 W   41 °C   33 °C   376 MHz    2 %    0 %   1177 MHz