AMD GPU Operator#
| Chart Name | Version | App version |
|---|---|---|
| amd-gpu-operator | v1.4.1 | v1.4.1 |
k0rdent operates in conjunction with the AMD GPU Operator by providing a higher-level orchestration layer for Kubernetes clusters equipped with AMD GPUs.
While the AMD GPU Operator focuses on the intricacies within a single cluster—managing the GPU drivers, device plugins, and necessary configurations — k0rdent extends this management across multiple clusters. It leverages tools like Cluster API to simplify the creation and management of these GPU-enabled Kubernetes environments.
This synergy allows for efficient utilization of distributed GPU resources, centralized policy enforcement, and comprehensive observability for AI applications running on AMD hardware.
Looking for Commercial Support? LEARN MORE
Prerequisites#
Deploy k0rdent v1.4.0: QuickStart
Install template to k0rdent#
helm upgrade --install amd-gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst --set "chart=amd-gpu-operator:v1.4.1" -n kcm-system
Verify service template#
Deploy service template#
Tested ClusterDeployment in Azure:
apiVersion: k0rdent.mirantis.com/v1beta1
kind: ClusterDeployment
metadata:
name: aws-example-cluster
namespace: kcm-system
labels:
group: demo
spec:
template: azure-standalone-cp-1-0-15
credential: azure-credential
config:
controlPlaneNumber: 1
workersNumber: 1
location: "westus"
subscriptionID: SUBSCRIPTION_ID # Enter the Subscription ID
controlPlane:
vmSize: Standard_A4_v2
worker:
image:
marketplace:
publisher: "microsoft-dsvm"
offer: "ubuntu-hpc"
sku: "2204-rocm" # Ubuntu 22.04 with AMD driver pre-installed
version: "22.04.2025041101"
rootVolumeSize: 500 # Big disk needed for testing images
# 8x AMD Instinct MI300 GPUs (expensive!)
vmSize: Standard_ND96is_MI300X_v5
Operator deployment:
apiVersion: k0rdent.mirantis.com/v1beta1
kind: MultiClusterService
metadata:
name: amd-gpu
spec:
clusterSelector:
matchLabels:
group: demo
serviceSpec:
services:
- template: cert-manager-1-18-2
name: cert-manager
namespace: cert-manager
values: |
cert-manager:
crds:
enabled: true
- template: amd-gpu-operator-v1-4-1
name: amd-gpu-operator
namespace: kube-amd-gpu
Manually create AMD GPU DeviceConfig CRD in the child cluster:
apiVersion: amd.com/v1beta1
kind: DeviceConfig
metadata:
name: amd-gpu-operator
namespace: kube-amd-gpu
spec:
driver:
enable: false
devicePlugin:
devicePluginImage: rocm/k8s-device-plugin:latest
nodeLabellerImage: rocm/k8s-device-plugin:labeller-latest
enableNodeLabeller: true
selector:
feature.node.kubernetes.io/amd-vgpu: "true"
Verification: Manually create a testing pod in child cluster:
apiVersion: v1
kind: Pod
metadata:
name: amd-smi
spec:
containers:
- image: docker.io/rocm/pytorch:latest # 20GB image !
name: amd-smi
command: ["/bin/bash"]
args: ["-c","amd-smi version && amd-smi monitor -ptum"]
resources:
limits:
amd.com/gpu: 1
requests:
amd.com/gpu: 1
restartPolicy: Never
Expected testing pod output (log):