K8s-Native AI Agents with GPU Virtualization: kagent + HAMi

A complete, real-world hands-on guide to running two CNCF sandbox projects (kagent and HAMi) on Nebius AI Cloud, with Nebius Token Factory as the LLM backend. Every command was executed live. Every output is real.

You are about to start reading a text that is far removed from any nonsense produced by prompt outputs.

1 > Why This Combination

kagent makes AI agents actual K8s resources: declarative, version-controlled, RBAC-governed, and inspectable with kubectl.

HAMi virtualizes GPU resources at the K8s scheduler level, allowing multiple pods to share a single physical GPU with hard memory and compute isolation.

Running these two together on Nebius AI Cloud creates something genuinely new.

An AI agent system that manages GPU-virtualized infrastructure, all without depending on any closed-source AI provider.

This is not a tutorial built from doc.

Every command in this article was run on a live my Nebius tarantula-instance VM with a real NVIDIA L40S GPU.

The outputs are real. The errors I hit are documented and explained. The fixes are verified.

2 > Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Nebius AI Cloud                              │
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │           my tarantula-instance (L40S GPU VM)           │    │
│  │                                                         │    │
│  │  ┌──────────────────────────────────────────────────┐   │    │
│  │  │              k3s K8s Cluster                     │   │    │
│  │  │                                                  │   │    │
│  │  │  ┌─────────────────────┐  ┌──────────────────┐   │   │    │
│  │  │  │   kagent namespace  │  │  kube-system ns  │   │   │    │
│  │  │  │                     │  │                  │   │   │    │
│  │  │  │  ┌───────────────┐  │  │  ┌────────────┐  │   │   │    │
│  │  │  │  │ k8s-agent     │  │  │  │ hami-      │  │   │   │    │
│  │  │  │  │ helm-agent    │  │  │  │ scheduler  │  │   │   │    │
│  │  │  │  │ sre-          │  │  │  │            │  │   │   │    │
│  │  │  │  │ orchestrator  │  │  │  │ hami-      │  │   │   │    │
│  │  │  │  │ + 8 more      │  │  │  │ device-    │  │   │   │    │
│  │  │  │  └───────────────┘  │  │  │ plugin     │  │   │   │    │
│  │  │  │                     │  │  └────────────┘  │   │   │    │
│  │  │  │  ┌───────────────┐  │  └──────────────────┘   │   │    │
│  │  │  │  │ ModelConfig   │  │                         │   │    │
│  │  │  │  │ (Nebius TF)   │  │  ┌──────────────────┐   │   │    │
│  │  │  │  └───────────────┘  │  │  Physical L40S   │   │   │    │
│  │  │  │                     │  │  GPU (46GB VRAM) │   │   │    │
│  │  │  │  ┌───────────────┐  │  │  → 10x vGPU      │   │   │    │
│  │  │  │  │ kagent-tools  │  │  │    via HAMi      │   │   │    │
│  │  │  │  │ MCP Server    │  │  └──────────────────┘   │   │    │
│  │  │  │  └───────────────┘  │                         │   │    │
│  │  │  └─────────────────────┘                         │   │    │
│  │  └──────────────────────────────────────────────────┘   │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                 │
│             ┌──────────────────────────────────┐                │
│             │    Nebius Token Factory          │                │
│             │    (OpenAI-compatible API)       │                │
│             │    Model: Llama 3.3 70B          │                │
│             │    Endpoint: api.studio.nebius.ai│                │
│             └──────────────────────────────────┘                │
└─────────────────────────────────────────────────────────────────┘

Data flow for every agent interaction:

User sends task via CLI (kagent invoke) or UI (http://89.169.110.178:8080)
kagent controller routes to the appropriate agent pod
Agent pod calls Nebius Token Factory API with the task + system prompt + tool definitions
Llama 3.3 70B decides which tool to call (e.g., k8s_get_resources, k8s_describe_resource)
kagent-tools MCP server executes the K8s API call
Result is returned to the LLM for interpretation
LLM generates a human-readable response
Response streams back to the user

3 > The Two CNCF Projects

kagent

GitHub: https://github.com/kagent-dev/kagent
Doc: https://kagent.dev

kagent introduces a fundamentally different model for AI agent deployment. Rather than treating agents as app-layer concerns, it treats them as infrastructure. Every agent is a K8s Custom Resource. Every tool server is a K8s resource. Every model config is a K8s resource.

HAMi

GitHub: https://github.com/Project-HAMi/HAMi
Doc: https://project-hami.io

HAMi is a K8s-native GPU virtualization solution. Originally known as k8s-vGPU-scheduler, it was accepted into the CNCF sandbox in late 2024.

HAMi works by:

Installing a custom scheduler (hami-scheduler) that intercepts GPU resource requests
Deploying a device plugin (hami-device-plugin) on each GPU node that registers virtual GPU resources with the kubelet
Injecting a libvgpu.so library into GPU workload containers via volume mounts that intercepts CUDA API calls to enforce memory and compute limits

A single physical GPU appears as multiple virtual GPUs to K8s, with hard isolation enforced at the CUDA driver level.

In my test:

1x NVIDIA L40S (46GB VRAM) → 10x virtual GPUs (nvidia.com/gpu: 10)
Each virtual GPU can be allocated specific memory limits via annotations
HAMi scheduler enforces these limits. Requesting more than available results in Pending pods

4 > My infrastructure: Nebius AI Cloud

For this article, I have used a single VM instance (tarantula-instance) with:

GPU:    1x NVIDIA L40S (46GB GDDR6 VRAM)
CPU:    8 vCPUs (Intel Ice Lake)
RAM:    32GB
Disk:   80GB SSD
OS:     Ubuntu 24.04 LTS for NVIDIA GPUs (CUDA 13)
Region: eu-north1

Nebius Token Factory provides an OpenAI-compatible API endpoint at:

https://api.studio.nebius.ai/v1

Models available at time of testing (selected):

Model ID	Type
`meta-llama/Llama-3.3-70B-Instruct`	Chat/Instruction
`meta-llama/Meta-Llama-3.1-8B-Instruct`	Chat/Instruction (fast)
`Qwen/Qwen3-32B`	Chat/Instruction
`Qwen/Qwen3-235B-A22B-Instruct-2507`	Large MoE
`deepseek-ai/DeepSeek-V3.2-fast`	Chat/Code
`nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`	Large
`openai/gpt-oss-120b`	GPT OSS

I used meta-llama/Llama-3.3-70B-Instruct for all kagent tests.

5 > Step 1: Setting Up k3s on Nebius

k3s is a CNCF-certified, lightweight K8s distribution ideal for single-node and edge deployments. It bundles everything into a single binary and starts in seconds. For my test, k3s gives us a full K8s cluster on a single VM without the overhead of kubeadm or kind.

Verifying the VM

Before installing anything, I verified the VM's hardware:

ubuntu@nebius-tarantula:~$ uname -a
Linux nebius-tarantula 6.11.0-1016-nvidia #16-Ubuntu SMP PREEMPT_DYNAMIC Sun Sep 21 17:06:33 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@nebius-tarantula:~$ nvidia-smi
Sat Apr 18 06:29:18 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------------------------------------------------------+
|   0  NVIDIA L40S                    On  |   00000000:8D:00.0 Off |                    0 |
| N/A   31C    P8             34W /  325W |       0MiB /  46068MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+

ubuntu@nebius-tarantula:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            31Gi       3.7Gi        25Gi       115Mi       2.2Gi        27Gi

What this tells us:

The NVIDIA driver (580.126.09) is pre-installed on the Nebius GPU image
The L40S has 46,068 MiB (approximately 45GB) of VRAM available
27GB RAM is available for K8s workloads
The kernel is an NVIDIA-optimized build

Installing k3s

curl -sfL https://get.k3s.io | sh -

Output:

[INFO]  Finding release for channel stable
[INFO]  Using v1.34.6+k3s1 as release
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.34.6%2Bk3s1/k3s
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
[INFO]  systemd: Starting k3s

k3s installs as a systemd service and starts automatically. It also creates symlinks for kubectl and crictl.

Configuring kubectl Access

k3s stores its kubeconfig at /etc/rancher/k3s/k3s.yaml, which is owned by root. I copy it to the user's home directory:

mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown ubuntu:ubuntu ~/.kube/config
export KUBECONFIG=~/.kube/config

Verifying

ubuntu@nebius-tarantula:~$ kgn
NAME               STATUS   ROLES           AGE   VERSION
nebius-tarantula   Ready    control-plane   48s   v1.34.6+k3s1

Installing Helm

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Output:

Downloading https://get.helm.sh/helm-v3.20.2-linux-amd64.tar.gz
helm installed into /usr/local/bin/helm

The package manager for K8s. I'll use it to install both kagent and HAMi.

6 > Step 2: Installing kagent with Helm

Creating the kagent Namespace

k create namespace kagent

Understanding the Two-Chart Installation

kagent uses a split Helm chart strategy:

kagent-crds: Installs only the Custom Resource Definitions. CRDs must be installed before the main chart because the main chart creates instances of those CRDs.
kagent: Installs the actual kagent components (controller, engine, UI, agents, tool servers).

This separation is a Helm best practice: if you upgrade kagent, the CRDs can be updated independently without risk of accidentally deleting custom resources.

Installing kagent CRDs

helm install kagent-crds \
  oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds \
  --namespace kagent

Output:

Pulled: ghcr.io/kagent-dev/kagent/helm/kagent-crds:0.8.6
Digest: sha256:768aec3722e15470af7d976c769a465dc164dbef9ccca6eb1912a7dfb2747db1
NAME: kagent-crds
STATUS: deployed
REVISION: 1

Installing kagent with Nebius Token Factory

This is where my setup differs from every other kagent tutorial. Instead of providing an OpenAI API key, I provide:

A Nebius Token Factory API key
The Nebius Token Factory base URL
A Llama model available on Nebius

export NEBIUS_API_KEY="v1.fdfdfdfdfdfdfdfdfdf..."  # Your Nebius Token Factory API key

helm install kagent \
  oci://ghcr.io/kagent-dev/kagent/helm/kagent \
  --namespace kagent \
  --set providers.default=openAI \
  --set providers.openAI.apiKey=$NEBIUS_API_KEY \
  --set providers.openAI.endpoint=https://api.studio.nebius.ai/v1 \
  --set providers.openAI.model=meta-llama/Llama-3.3-70B-Instruct

Why providers.default=openAI? kagent uses openAI as the provider type because Nebius Token Factory implements the OpenAI-compatible API specification. The provider type determines which API client code to use. Since Nebius implements the same /v1/chat/completions endpoint as OpenAI, the openAI provider works perfectly.

Output (abbreviated):

Pulled: ghcr.io/kagent-dev/kagent/helm/kagent:0.8.6
NAME: kagent
STATUS: deployed
REVISION: 1

DEPLOYED COMPONENTS:
  - Controller: kagent-controller
  - Engine: kagent-engine (AI agent runtime)
  - UI: kagent-ui (web interface)
  - KMCP Controller: kagent-kmcp-manager-controller

ENABLED AGENTS:
  - argo-rollouts-agent
  - cilium-debug-agent
  - cilium-manager-agent
  - cilium-policy-agent
  - helm-agent
  - istio-agent
  - k8s-agent
  - kgateway-agent
  - observability-agent
  - promql-agent

10 agents are deployed by default. Each is a specialized AI agent with its own system prompt, tool set, and deployment. This is one of kagent's strongest features: a prod-ready agent catalog out of the box.

Fixing the ModelConfig Endpoint

After installation, I discovered that the providers.openAI.endpoint Helm value doesn't map to the openAI.baseUrl field in the ModelConfig CRD. I needed to patch it directly:

# First, verify the API key is in the secret
k create secret generic kagent-openai \
  --from-literal=OPENAI_API_KEY=$NEBIUS_API_KEY \
  -n kagent --dry-run=client -o yaml | k apply -f -

# Then patch the ModelConfig to use Nebius endpoint
k patch modelconfig default-model-config --type=merge -p '{
  "spec": {
    "openAI": {
      "baseUrl": "https://api.studio.nebius.ai/v1"
    },
    "model": "meta-llama/Llama-3.3-70B-Instruct"
  }
}'

The Helm --set providers.openAI.endpoint flag creates the secret correctly but doesn't configure the baseUrl in the ModelConfig. Always verify with k get modelconfig default-model-config -o yaml after installation.

Verifying Pod Status

ubuntu@nebius-tarantula:~$ kgp -n kagent
NAME                                              READY   STATUS    RESTARTS   AGE
argo-rollouts-conversion-agent-7b77468d59-x2lb7   1/1     Running   0          2m18s
cilium-debug-agent-5b786d455d-jq29s               1/1     Running   0          2m18s
cilium-manager-agent-ff5cd9578-pxwtf              1/1     Running   0          2m18s
cilium-policy-agent-8448976f56-jvz9q              1/1     Running   0          2m18s
helm-agent-57479bd6c6-jvfm2                       1/1     Running   0          2m18s
istio-agent-5d5bfc8ff-7fbdz                       1/1     Running   0          2m18s
k8s-agent-f86f9f96f-7lv7r                         1/1     Running   0          2m18s
kagent-controller-7f8c569d89-p6br9                1/1     Running   0          3m30s
kagent-grafana-mcp-696d674dff-2vlvf               1/1     Running   0          3m30s
kagent-kmcp-controller-manager-777746db7c-xxkpg   1/1     Running   0          3m30s
kagent-postgresql-5d75cf4d86-lnd5f                1/1     Running   0          3m30s
kagent-querydoc-74d9f7bbdc-gk8lg                  1/1     Running   0          3m30s
kagent-tools-77c6f575c8-vd2s7                     1/1     Running   0          3m30s
kagent-ui-7fcd65ff6d-hcdph                        1/1     Running   0          3m30s
kgateway-agent-6674fd5f79-8knkx                   1/1     Running   0          2m17s
observability-agent-79f8bfb465-8sn4p              1/1     Running   0          2m18s
promql-agent-68bb8c45d6-7hcwt                     1/1     Running   0          2m17s

17 pods, all Running:

10 agent pods (one per agent type)
kagent-controller (the main K8s controller)
kagent-postgresql (session/conversation storage)
kagent-tools (the MCP tool server)
kagent-ui (the web dashboard)
kagent-grafana-mcp (Grafana MCP server)
kagent-querydoc (doc query MCP server)
kagent-kmcp-controller-manager (MCP server lifecycle manager)

7 > Step 3: Connecting to Nebius Token Factory

Getting an API Key

Log in to studio.nebius.ai or tokenfactory.nebius.com
Navigate to API Keys in the left sidebar
Click Create API Key
Set an expiration date (I used 5 years for this test)
Copy the key securely. Treat it like a password.

The key follows a Nebius-specific format but is used identically to an OpenAI API key in API calls.

Finding Available Models

To see which models are available on your Nebius Token Factory account:

curl https://api.studio.nebius.ai/v1/models \
  -H "Authorization: Bearer $NEBIUS_API_KEY" | python3 -m json.tool | grep '"id"'

Output (selected):

"id": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"id": "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1",
"id": "meta-llama/Llama-3.3-70B-Instruct",
"id": "Qwen/Qwen3-235B-A22B-Instruct-2507",
"id": "Qwen/Qwen3-32B",
"id": "deepseek-ai/DeepSeek-V3.2-fast",
"id": "openai/gpt-oss-120b",
"id": "MiniMaxAI/MiniMax-M2.5-fast",

Why I chose meta-llama/Llama-3.3-70B-Instruct: This model has strong instruction-following capabilities and handles structured tool-use (function calling) reliably. For K8s operations where the agent must decide which kubectl-equivalent tool to call with specific parameters, a 70B model provides the reasoning depth needed.

Verifying the ModelConfig

After patching, the ModelConfig should look like this:

k get modelconfig default-model-config -o yaml

apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: default-model-config
  namespace: kagent
spec:
  apiKeySecret: kagent-openai
  apiKeySecretKey: OPENAI_API_KEY
  model: meta-llama/Llama-3.3-70B-Instruct
  openAI:
    baseUrl: https://api.studio.nebius.ai/v1
  provider: OpenAI
status:
  conditions:
  - message: Model config accepted
    reason: ModelConfigReconciled
    status: "True"
    type: Accepted

The provider: OpenAI field doesn't mean I'm using OpenAI. It means I'm using the OpenAI-compatible API protocol. The actual requests go to https://api.studio.nebius.ai/v1 with a Nebius API key. This is the power of the OpenAI-compatible API standard. Any provider implementing it can be swapped in.

8 > Step 4: Understanding kagent CRDs

Listing All kagent CRDs

k get crds | grep kagent

agents.kagent.dev                              2026-04-18T06:32:21Z
mcpservers.kagent.dev                          2026-04-18T06:32:21Z
memories.kagent.dev                            2026-04-18T06:32:21Z
modelconfigs.kagent.dev                        2026-04-18T06:32:21Z
modelproviderconfigs.kagent.dev                2026-04-18T06:32:21Z
remotemcpservers.kagent.dev                    2026-04-18T06:32:21Z
toolservers.kagent.dev                         2026-04-18T06:32:21Z

7 CRDs. Each one represents a different concern in the kagent architecture. Let's explore each.

The Agent CRD

An Agent resource defines a complete AI agent. Let's look at the k8s-agent:

k describe agent k8s-agent

The key fields are:

spec.declarative.modelConfig: References the ModelConfig to use (default-model-config)
spec.declarative.systemMessage: The full system prompt that defines agent behavior
spec.declarative.tools: List of MCP tools the agent can access
spec.declarative.runtime: python or go (the agent runtime engine)
spec.declarative.a2aConfig.skills: Agent capabilities advertised via the A2A protocol
status.conditions: Current state - Accepted and Ready

The k8s-agent's system prompt defines it as "KubeAssist" with access to 19 tools covering everything from k8s_get_resources to k8s_apply_manifest to k8s_execute_command.

The RemoteMCPServer CRD

k get remotemcpservers

NAME                   PROTOCOL          URL                                         ACCEPTED
kagent-grafana-mcp     STREAMABLE_HTTP   http://kagent-grafana-mcp.kagent:8000/mcp   True
kagent-tool-server     STREAMABLE_HTTP   http://kagent-tools.kagent:8084/mcp         True

Two MCP servers are registered:

kagent-tool-server: The main K8s tool server with 175 tools across K8s, Argo, Cilium, Helm, Istio, and more
kagent-grafana-mcp: A Grafana integration server for metrics and dashboards

By using MCP, kagent tools can be consumed by any MCP-compatible AI system, not just kagent.

Viewing All Resources at Once

k get agents,modelconfigs,toolservers,memories,remotemcpservers

NAME                                              TYPE          RUNTIME   READY   ACCEPTED
agent.kagent.dev/argo-rollouts-conversion-agent   Declarative   python    True    True
agent.kagent.dev/cilium-debug-agent               Declarative   python    True    True
agent.kagent.dev/cilium-manager-agent             Declarative   python    True    True
agent.kagent.dev/cilium-policy-agent              Declarative   python    True    True
agent.kagent.dev/helm-agent                       Declarative   python    True    True
agent.kagent.dev/istio-agent                      Declarative   python    True    True
agent.kagent.dev/k8s-agent                        Declarative   python    True    True
agent.kagent.dev/kgateway-agent                   Declarative   python    True    True
agent.kagent.dev/observability-agent              Declarative   python    True    True
agent.kagent.dev/promql-agent                     Declarative   python    True    True

NAME                                          PROVIDER   MODEL
modelconfig.kagent.dev/default-model-config   OpenAI     meta-llama/Llama-3.3-70B-Instruct

NAME                                            PROTOCOL          URL                                         ACCEPTED
remotemcpserver.kagent.dev/kagent-grafana-mcp   STREAMABLE_HTTP   http://kagent-grafana-mcp.kagent:8000/mcp   True
remotemcpserver.kagent.dev/kagent-tool-server   STREAMABLE_HTTP   http://kagent-tools.kagent:8084/mcp         True

All 10 agents are READY=True and ACCEPTED=True. The controller has reconciled every resource to its desired state.

9 > Step 5: Installing HAMi GPU Virtualization

Before installing HAMi, I verified that the NVIDIA Container Toolkit was working correctly:

nvidia-container-cli --version

cli-version: 1.18.2
lib-version: 1.18.2
build date: 2026-01-21T10:48+00:00

nvidia-container-cli info

NVRM version:   580.126.09
CUDA version:   13.0
Device Index:   0
Device Minor:   0
Model:          NVIDIA L40S
Brand:          Nvidia
GPU UUID:       GPU-fdfdf-fdfd-4545-4545-gfgfg
Bus Location:   00000000:8d:00.0
Architecture:   8.9

The NVIDIA container runtime is installed and can communicate with the GPU driver. The GPU UUID (GPU-fdfdf-fdfd-4545-4545-gfgfg) will appear throughout HAMi's allocation logs, providing a consistent identifier for the physical GPU.

Understanding k3s's containerd Config

k3s uses its own embedded containerd instance, not the system containerd. This is crucial for HAMi integration. k3s's containerd config lives at:

sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

# File generated by k3s. DO NOT EDIT. Use config-v3.toml.tmpl instead.
version = 3
imports = ["/var/lib/rancher/k3s/agent/etc/containerd/config-v3.toml.d/*.toml"]
root = "/var/lib/rancher/k3s/agent/containerd"
state = "/run/k3s/containerd"
[grpc]
  address = "/run/k3s/containerd/containerd.sock"

Two observations:

k3s uses version 3 containerd config (newer format with plugins.'io.containerd.cri.v1.runtime')
k3s uses a different containerd socket (/run/k3s/containerd/containerd.sock) than system containerd

The config also already has an nvidia runtime section:

[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia']
  runtime_type = "io.containerd.runc.v2"
[plugins.'io.containerd.cri.v1.runtime'.containerd.runtimes.'nvidia'.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

This means k3s already knows about the nvidia runtime. I just need to ensure HAMi can use it.

The Critical Fix: Removing the Conflicting Config

The default nvidia-ctk runtime configure --runtime=containerd command creates a config file at /etc/containerd/conf.d/99-nvidia.toml using the version 1 containerd format. This conflicts with k3s's version 3 format.

The fix is to remove this conflicting config:

sudo rm /etc/containerd/conf.d/99-nvidia.toml
sudo systemctl restart containerd
sudo systemctl restart k3s

This allows k3s to use its own built-in nvidia runtime config without interference.

Adding Helm Repo and Installing HAMi

helm repo add hami-charts https://project-hami.github.io/HAMi/
helm repo update

k label node nebius-tarantula gpu=on

Why label the node? HAMi's device plugin DaemonSet uses a node selector for gpu=on. Without this label, the DaemonSet won't schedule on the node and HAMi won't see the GPU.

helm install hami hami-charts/hami \
  --set scheduler.kubeScheduler.imageTag=v1.34.6 \
  --set devicePlugin.nvidiaDriverRoot=/usr \
  -n kube-system

scheduler.kubeScheduler.imageTag=v1.34.6: Must match the k3s K8s version exactly. HAMi's scheduler extends the default K8s scheduler, so version compatibility is critical.
devicePlugin.nvidiaDriverRoot=/usr: Tells HAMi where to find the NVIDIA libraries on the host. On Ubuntu with standard NVIDIA drivers, libraries are in /usr/lib and binaries in /usr/bin, so /usr is the correct root.

Verifying HAMi Installation

kgp -n kube-system | grep hami

hami-device-plugin-cw6t5          2/2     Running   0   2m58s
hami-scheduler-84f8888fc5-fqrp8   2/2     Running   0   3m40s

hami-device-plugin: Contains device-plugin (registers vGPU resources) and vgpu-monitor (metrics collection)
hami-scheduler: Contains the HAMi scheduler extender and a webhook for pod mutation

10 > Step 6: HAMi + containerd Integration

Understanding exactly how HAMi integrates with containerd and K8s is key to troubleshooting and prod deployments.

How HAMi Allocates GPUs

When a pod requests nvidia.com/gpu: 1, the following happens:

Admission webhook intercepts the pod creation request and sets schedulerName: hami-scheduler
hami-scheduler evaluates available virtual GPU resources across nodes

HAMi adds allocation annotations to the pod:

hami.io/bind-phase: success
hami.io/vgpu-devices-allocated: GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7,NVIDIA,46068,100:;
hami.io/vgpu-node: nebius-tarantula

hami-device-plugin receives the Allocate call from kubelet and injects:
- NVIDIA_VISIBLE_DEVICES env variable (the physical GPU UUID)
- CUDA_DEVICE_MEMORY_LIMIT_0 (memory limit in MiB)
- CUDA_DEVICE_SM_LIMIT (compute core percentage)
- Volume mounts for libvgpu.so (the CUDA intercept library)
- /etc/ld.so.preload pointing to libvgpu.so

The libvgpu.so Mechanism

The HAMi device plugin log reveals exactly what gets injected:

Allocate Response [&ContainerAllocateResponse{
  Envs: map[string]string{
    CUDA_DEVICE_MEMORY_LIMIT_0: 46068m,
    CUDA_DEVICE_MEMORY_SHARED_CACHE: /usr/local/vgpu/4fgff375-fgfg-fgfg-fgfg-1e78b8cd030f.cache,
    CUDA_DEVICE_SM_LIMIT: 100,
    NVIDIA_VISIBLE_DEVICES: GPU-745dfdfd-dfdf-dfdf-dfdf-dfdfdfdfdfdf,
  },
  Mounts: []*Mount{
    &Mount{
      ContainerPath: /usr/local/vgpu/libvgpu.so,
      HostPath: /usr/local/vgpu/libvgpu.so.v2.8.1,
      ReadOnly: true,
    },
    &Mount{
      ContainerPath: /etc/ld.so.preload,
      HostPath: /usr/local/vgpu/ld.so.preload,
      ReadOnly: true,
    },
  },
}]

/etc/ld.so.preload causes the Linux dynamic linker to load libvgpu.so before any other library in every process. libvgpu.so intercepts CUDA API calls between the CUDA runtime (libcudart.so) and the CUDA driver (libcuda.so), enforcing memory and compute limits transparently.

11 > Step 7: Verifying GPU Virtualization

Checking Node Allocatable Resources

kgn nebius-tarantula -o jsonpath='{.status.allocatable}' | python3 -m json.tool

{
    "cpu": "8",
    "ephemeral-storage": "77982375671",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "32865164Ki",
    "nvidia.com/gpu": "10",
    "pods": "110"
}

This is the proof that HAMi is working. Before HAMi, nvidia.com/gpu was absent from the allocatable resources. Now it shows 10. HAMi has virtualized the single physical L40S into 10 virtual GPUs.

The node annotation confirms the GPU registration:

k describe node nebius-tarantula | grep "hami.io/node-nvidia-register" -A1

hami.io/node-nvidia-register:
  [{"id":"GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7","count":10,"devmem":46068,"devcore":100,"type":"NVIDIA L40S","mode":"hami-core","health":true}]

Breaking down this JSON:

id: The physical GPU UUID
count: 10 virtual GPUs registered
devmem: 46,068 MiB total VRAM
devcore: 100% of GPU compute cores available
mode: hami-core, HAMi's software-based CUDA interception mode
health: true - GPU is healthy

12 > Step 8: Agent Testing (k8s-agent)

First Successful Test: Cluster Status

# Start port-forward for the controller API
k port-forward svc/kagent-controller 8083:8083 -n kagent --address 0.0.0.0 &

# Invoke the k8s-agent
kagent invoke --agent k8s-agent \
  --task "What is the status of this cluster? List all running pods." \
  --stream

The agent's response stream shows the complete reasoning process:

Tool call decision: The LLM decides to call k8s_get_resources with parameters {"all_namespaces":"true","output":"wide","resource_type":"pods"}
Tool execution: The MCP server executes the equivalent of kgp -A -o wide
All 25 running pods with their IPs, nodes, and status
Synthesis: The LLM summarizes: "The cluster has 25 running pods across different namespaces, including kagent, and kube-system."

Token usage: 6,131 total tokens (4,655 prompt + 1,476 completion)

Health Check Test

kagent invoke --agent k8s-agent \
  --task "Check if there are any pods with high restart counts or in error state. Give me a health summary." \
  --stream

Agent response:

Based on the provided output, there are no pods with high restart counts or in error state.

Health summary:
- Total pods: 24
- Running pods: 24
- Pods with restarts: 0
- Error pods: 0

The cluster appears to be in a good state.

Node GPU Check (Before HAMi)

Before HAMi was installed, I asked the agent about GPU availability:

kagent invoke --agent k8s-agent \
  --task "Describe the node nebius-tarantula in detail. What GPUs are available?" \
  --stream

Agent response (before HAMi):

"The node 'nebius-tarantula' does not have any GPUs available. The output of the describe node command shows no indication of any GPU devices."

This confirms that without HAMi (or the NVIDIA device plugin), K8s has no visibility into the GPU. It's just a CPU node from K8s' perspective.

Node GPU Check (After HAMi)

kagent invoke --agent k8s-agent \
  --task "Check the node nebius-tarantula allocatable resources. How many GPUs are available and what type?" \
  --stream

Agent response (after HAMi):

"The node 'nebius-tarantula' has 10 GPUs available, and they are of type 'NVIDIA L40S'."

The agent correctly reads the HAMi-registered virtual GPU count and identifies the GPU model from the node annotations.

Self-Inspection Test

One of the most interesting tests: asking the agent to describe itself using the K8s API.

kagent invoke --agent k8s-agent \
  --task "Describe yourself: what CRD defines you, what tools do you have, and what is your modelconfig? Use kubectl to inspect your own agent resource." \
  --stream

The agent:

Tried k8s_describe_resource for agent k8s_agent -n default, failed (wrong namespace and name format),
Tried k8s_get_resources for agents. Found all agents with their namespaces,
Tried k8s_describe_resource for agent k8s-agent -n kagent, succeeded,
Read its own CRD definition including system prompt, tool list, and model config,
Synthesized a response explaining its own architecture.

Final response:

"The K8s AI Agent System Prompt is defined by the CustomResourceDefinition (CRD) named 'Agent' in the 'kagent.dev' API group. The agent has access to various tools for diagnosing and solving K8s issues, including informational tools like GetResources, DescribeResource, and GetEvents, as well as modification tools like CreateResource, ApplyManifest, and PatchResource. The agent's model config is set to 'default-model-config'."

This demonstrates genuine meta-awareness: the agent can read and understand its own K8s definition.

Deployment Creation Test

kagent invoke --agent k8s-agent \
  --task "Create a simple nginx deployment with 2 replicas in the default namespace. Name it 'test-nginx'." \
  --stream

The agent generated and applied this manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

The agent deployed to the kagent namespace (its own namespace) rather than default. This is because the agent pod runs in kagent and the tool server's kubectl context defaults to that namespace. In prod, you would configure the tool server's kubeconfig with appropriate namespace defaults.

Deployment Deletion Test

kagent invoke --agent k8s-agent \
  --task "Delete the deployment named test-nginx in the kagent namespace." \
  --stream

The complete create → verify → delete cycle demonstrates that kagent agents can manage the full K8s resource lifecycle through natural language instructions.

13 > Step 9: Agent Testing (helm-agent)

kagent invoke --agent helm-agent \
  --task "List all helm releases installed in this cluster with their status and namespaces." \
  --stream

The helm-agent first tried with --namespace all flag which failed (invalid Helm flag), then retried without the namespace flag:

NAME        NAMESPACE   REVISION   UPDATED                                 STATUS     CHART              APP VERSION
kagent      kagent      1          2026-04-18 07:27:17.040520124 +0000 UTC deployed   kagent-0.8.6
kagent-crds kagent      1          2026-04-18 07:27:12.176735317 +0000 UTC deployed   kagent-crds-0.8.6

The agent automatically retried with different parameters when the first attempt failed. This retry-and-adapt behavior is emergent from the LLM's reasoning. It wasn't explicitly programmed. The agent recognized the error message and adjusted its approach.

14 > Step 10: Custom Agent Creation (sre-orchestrator)

Beyond the built-in agents, kagent allows creating custom agents declaratively. I created an SRE orchestrator agent that can delegate to both the k8s-agent and promql-agent:

apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:
  name: sre-orchestrator
  namespace: kagent
spec:
  type: Declarative
  description: "SRE Orchestrator that can delegate to k8s-agent and promql-agent"
  declarative:
    modelConfig: default-model-config
    systemMessage: |
      You are an SRE orchestrator. You have access to K8s tools and can 
      delegate to specialized agents. When asked about metrics, use promql-agent.
      When asked about cluster state, use k8s tools directly.
    tools:
    - type: McpServer
      mcpServer:
        name: kagent-tool-server
        kind: RemoteMCPServer
        apiGroup: kagent.dev
        toolNames:
        - k8s_get_resources
        - k8s_describe_resource
        - k8s_get_pod_logs
        - k8s_get_events
    - type: Agent
      agent:
        name: promql-agent

k apply -f sre-orchestrator.yaml

The kagent controller immediately:

Detected the new Agent resource
Created a Deployment for the sre-orchestrator pod
Set up the A2A endpoint at http://sre-orchestrator.kagent:8080
Registered promql-agent as a callable tool

k get agent sre-orchestrator

NAME               TYPE          RUNTIME   READY   ACCEPTED
sre-orchestrator   Declarative   python    True    True

Ready in under 2 minutes.

15 > Step 11: A2A Protocol (Agent-to-Agent Communication)

The Agent-to-Agent protocol is one of kagent's most powerful features. It allows agents to call other agents as tools, enabling complex multi-agent workflows.

How A2A Works

When an agent is listed as a type: Agent tool, kagent:

Creates an MCP server endpoint for that agent at /.well-known/agent-card.json
Exposes the agent's skills as callable tools
Routes inter-agent calls through the kagent controller

The A2A Test

kagent invoke --agent sre-orchestrator \
  --task "Check the status of all pods in kagent namespace, then ask promql-agent for the CPU usage query for these pods." \
  --stream

Watching the streaming output reveals the multi-agent workflow:

Step 1 > sre-orchestrator calls k8s tool directly:

{"name": "k8s_get_resources", "args": {"namespace": "kagent", "resource_type": "pod"}}

Step 2 > sre-orchestrator delegates to promql-agent:

{"name": "kagent__NS__promql_agent", "args": {"request": "CPU usage query for pods in kagent namespace"}}

Step 3 > promql-agent receives the request in its own session:

{"subagent_session_id": "2ab4bf39-379d-404c-8fda-0e6c9e28c5da"}

What the session ID tells us: The subagent_session_id is different from the parent session ID. Each agent runs in its own isolated session with its own context window. The sre-orchestrator doesn't see promql-agent's internal reasoning, only its final response.

Token accounting:

sre-orchestrator used: 2,438 tokens
promql-agent sub-session used: 1,869 tokens
Total: 4,307 tokens for the multi-agent interaction

This demonstrates true A2A delegation: the orchestrator doesn't need to know how to write PromQL queries. It just delegates to a specialist.

16 > Step 12: kagent Manages HAMi GPU Workloads

This is the centerpiece of the entire test: using kagent AI agents to manage HAMi GPU-virtualized workloads.

HAMi-annotated node details returned by the agent

Agent-Requested GPU Pod Deployment

kagent invoke --agent k8s-agent \
  --task "Create a pod named 'gpu-test-1' in default namespace that requests 1 nvidia.com/gpu with 20000 GPU memory limit using HAMi annotations (nvidia.com/gpumem: 20000). Use nvidia/cuda:12.3.0-base-ubuntu22.04 image and run sleep infinity." \
  --stream

The agent generated the correct HAMi-annotated manifest:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-1
  namespace: default
  annotations:
    nvidia.com/gpumem: "20000"
spec:
  containers:
  - name: gpu-test
    image: nvidia/cuda:12.3.0-base-ubuntu22.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1

HAMi immediately picked up the pod and annotated it with allocation details:

hami.io/bind-phase: success
hami.io/vgpu-devices-allocated: GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7,NVIDIA,46068,100:;
nvidia.com/gpumem: 20000

Dual-Pod GPU Sharing

I then created a second pod requesting 15,000 MiB of GPU memory:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test-3
  namespace: kagent
  annotations:
    nvidia.com/gpumem: "15000"
spec:
  containers:
  - name: gpu-test-3
    image: ubuntu:22.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1

Both pods were allocated to the same physical GPU (GPU-745c1c2c). HAMi's scheduler assigned them successfully because:

gpu-test-2: 20,000 MiB requested
gpu-test-3: 15,000 MiB requested
Total: 35,000 MiB, well within the 46,068 MiB physical limit

kagent Analyzes the GPU Sharing

kagent invoke --agent k8s-agent \
  --task "Look at pods gpu-test-2 and gpu-test-3 in kagent namespace. Check their HAMi annotations and explain how HAMi is sharing the GPU between these two pods. What GPU UUID is being used?" \
  --stream

Agent response:

"The HAMi annotations are being used to manage the GPU resources between the two pods. The 'hami.io/vgpu-devices-allocated: GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7,NVIDIA,46068,100:;' annotation shows that the GPU device with UUID 'GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7' was allocated to the pod.

The 'nvidia.com/gpumem: 20000' annotation in the gpu-test-2 pod and the 'nvidia.com/gpumem: 15000' annotation in the gpu-test-3 pod indicate that the GPU memory is being shared between the two pods. In summary, the GPU device with UUID 'GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7' is being shared between the two pods."

The agent correctly identified the GPU UUID, understood the memory allocation annotations, and explained the sharing mechanism. All by reading K8s resource YAML through the MCP tool interface.

17 > Step 13: GPU Overcommit Protection

Testing HAMi's Enforcement

I created a pod requesting 11 virtual GPUs, more than the 10 available:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-overcommit-test
  namespace: kagent
  annotations:
    nvidia.com/gpumem: "5000"
spec:
  containers:
  - name: gpu-overcommit
    image: ubuntu:22.04
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 11

k apply -f gpu-overcommit-test.yaml
k describe pod gpu-overcommit-test | tail -10

Events:
  Warning  FailedScheduling  3m10s                hami-scheduler  0/1 nodes are available: 1 NodeUnfitPod. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  28s (x2 over 3m10s)  hami-scheduler  0/1 nodes are available: 1 NodeUnfitPod. no new claims to deallocate, preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FilteringFailed   16s (x4 over 3m10s)  hami-scheduler  no available node, 1 nodes do not meet

The pod remains Pending indefinitely. HAMi's scheduler correctly refuses to schedule a pod that requests more virtual GPUs than are available. The error message NodeUnfitPod and FilteringFailed indicate that the HAMi scheduler's filtering phase rejected all available nodes.

This is precisely how a prod GPU virtualization system should behave: hard enforcement of resource limits, not best-effort.

18 > Step 14: HAMi Metrics and Observability

The HAMi Prometheus Metrics Endpoint

HAMi exposes GPU metrics in Prometheus format at the monitoring service:

k get svc -n kube-system | grep hami

hami-device-plugin-monitor   NodePort   10.43.107.36   <none>   31992:31992/TCP   64m
hami-scheduler               NodePort   10.43.242.176  <none>   443:31998/TCP,31993:31993/TCP   64m

curl http://localhost:31992/metrics

# HELP HostCoreUtilization GPU core utilization
# TYPE HostCoreUtilization gauge
HostCoreUtilization{deviceidx="0",devicetype="NVIDIA-NVIDIA L40S",deviceuuid="GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7",zone="vGPU"} 0

# HELP HostGPUMemoryUsage GPU device memory usage
# TYPE HostGPUMemoryUsage gauge
HostGPUMemoryUsage{deviceidx="0",devicetype="NVIDIA-NVIDIA L40S",deviceuuid="GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7",zone="vGPU"} 6.40090112e+08

# HELP hami_build_info hami build metadata
# TYPE hami_build_info gauge
hami_build_info{build_date="20260417-04:25:02",compiler="gc",go_version="go1.25.5",platform="linux/amd64",revision="713ad31",version="v2.8.1"} 1

Reading these metrics:

HostCoreUtilization: Currently 0%, my test pods are running sleep infinity, not using GPU compute
HostGPUMemoryUsage: 640,090,112 bytes (~610 MiB), this is the CUDA runtime overhead, not workload memory
hami_build_info: Confirms HAMi v2.8.1 is running, built with Go 1.25.5

kagent Analyzes HAMi Metrics

I pointed the k8s-agent at the HAMi monitoring service and asked it to analyze the available metrics:

kagent invoke --agent k8s-agent \
  --task "The HAMi GPU monitor is running at http://10.43.107.36:31992/metrics. Check the service 'hami-device-plugin-monitor' in kube-system namespace and tell me what GPU metrics are available." \
  --stream

The agent:

Described the hami-device-plugin-monitor service
Checked service connectivity (got 404 at root, correct, metrics are at /metrics)
Listed all hami pods in kube-system
Retrieved device plugin logs and analyzed the GPU allocation events

From the device plugin logs, the agent extracted:

CUDA_DEVICE_MEMORY_LIMIT_0: 46068m
CUDA_DEVICE_MEMORY_SHARED_CACHE: /usr/local/vgpu/[uuid].cache
CUDA_DEVICE_SM_LIMIT: 100
NVIDIA_VISIBLE_DEVICES: GPU-745c1c2c-56d7-168a-ca56-97e3dd8e0db7

These are the env variables that HAMi injects into each GPU-using container. The agent understood and explained their purpose.

19 > Step 15: kagent CLI Deep Dive

The kagent CLI (version 0.9.0-beta5) provides several resource management commands beyond invoke:

Listing Agents

kagent get agent

+----+---------------------------------------+----------------------+------------------+----------+
| #  | NAME                                  | CREATED              | DEPLOYMENT_READY | ACCEPTED |
+----+---------------------------------------+----------------------+------------------+----------+
| 1  | kagent/kgateway-agent                 | 2026-04-18T07:27:18Z | true             | true     |
| 2  | kagent/observability-agent            | 2026-04-18T07:27:18Z | true             | true     |
| 3  | kagent/promql-agent                   | 2026-04-18T07:27:18Z | true             | true     |
| 4  | kagent/sre-orchestrator               | 2026-04-18T07:45:56Z | true             | true     |
| 5  | kagent/cilium-manager-agent           | 2026-04-18T07:27:18Z | true             | true     |
| 6  | kagent/cilium-policy-agent            | 2026-04-18T07:27:18Z | true             | true     |
| 7  | kagent/istio-agent                    | 2026-04-18T07:27:18Z | true             | true     |
| 8  | kagent/k8s-agent                      | 2026-04-18T07:27:18Z | true             | true     |
| 9  | kagent/argo-rollouts-conversion-agent | 2026-04-18T07:27:18Z | true             | true     |
| 10 | kagent/cilium-debug-agent             | 2026-04-18T07:27:18Z | true             | true     |
| 11 | kagent/helm-agent                     | 2026-04-18T07:27:18Z | true             | true     |
+----+---------------------------------------+----------------------+------------------+----------+

Note that sre-orchestrator (row 4) shows a later creation timestamp. This is my custom agent, created during the session.

Listing Sessions

kagent get session

+----+------------------------------------------+-------------------------+------------------------------+----------------------+
| #  | ID                                       | NAME                    | AGENT                        | CREATED              |
+----+------------------------------------------+-------------------------+------------------------------+----------------------+
| 1  | ctx-31dcebe5-...                         | List all running pod... | kagent__NS__k8s_agent        | 2026-04-18T07:34:56Z |
| 2  | 94aca692-...                             | List all helm releas... | kagent__NS__helm_agent       | 2026-04-18T07:39:46Z |
| 3  | 84249c1c-...                             | Create a simple ngin... | kagent__NS__k8s_agent        | 2026-04-18T07:40:50Z |
| 4  | a862ac3e-...                             | Delete the deploymen... | kagent__NS__k8s_agent        | 2026-04-18T07:42:45Z |
| 5  | aaa28834-...                             | What PromQL query wo... | kagent__NS__promql_agent     | 2026-04-18T07:44:18Z |
| 6  | 6e4e0146-...                             | Check the status of ... | kagent__NS__sre_orchestrator | 2026-04-18T07:46:36Z |
| 7  | 2ab4bf39-...                             | CPU usage query for ... | kagent__NS__promql_agent     | 2026-04-18T07:46:40Z |
| 8  | 1368a699-...                             | Describe yourself: w... | kagent__NS__k8s_agent        | 2026-04-18T07:48:03Z |
+----+------------------------------------------+-------------------------+------------------------------+----------------------+

Critical observation about session 6 and 7: Session 6 is the sre-orchestrator session. Session 7 is a promql-agent session created at nearly the same time. This is the A2A sub-session. The orchestrator delegated to promql-agent, creating a separate session stored in PostgreSQL. The session history provides a complete audit trail of all agent interactions.

Listing Tools

kagent get tool

This returns all 175 tools organized by category:

Argo (8 tools): rollout management, plugin verification
Cilium (15 tools): network policy, endpoint management, debugging
Helm (12 tools): release management, chart operations
Istio (18 tools): service mesh config, traffic management
K8s (19 tools): full kubectl-equivalent operations
Gateway (12 tools): kgateway/Gateway API management
Prometheus (8 tools): PromQL execution, metric queries
Grafana (various): dashboard management via MCP

20 > Step 16: A2A Agent Card Protocol

Every kagent agent exposes a standard A2A agent card at /.well-known/agent-card.json. This is how agents discover each other's capabilities.

k8s-agent Card

curl http://10.42.0.68:8080/.well-known/agent-card.json | python3 -m json.tool

{
    "capabilities": {
        "pushNotifications": false,
        "stateTransitionHistory": true,
        "streaming": true
    },
    "defaultInputModes": ["text"],
    "defaultOutputModes": ["text"],
    "description": "An K8s Expert AI Agent specializing in cluster operations, troubleshooting, and maintenance.",
    "name": "k8s_agent",
    "preferredTransport": "JSONRPC",
    "protocolVersion": "0.3.0",
    "skills": [
        {
            "description": "The ability to analyze and diagnose K8s Cluster issues.",
            "examples": [
                "What is the status of my cluster?",
                "How can I troubleshoot a failing pod?",
                "What are the resource limits for my nodes?"
            ],
            "id": "cluster-diagnostics",
            "name": "Cluster Diagnostics",
            "tags": ["cluster", "diagnostics"]
        },
        {
            "id": "resource-management",
            "name": "Resource Management",
            "tags": ["resource", "management"]
        },
        {
            "id": "security-audit",
            "name": "Security Audit",
            "tags": ["security", "audit"]
        }
    ],
    "url": "http://k8s-agent.kagent:8080",
    "version": ""
}

sre-orchestrator Card

curl http://10.42.0.81:8080/.well-known/agent-card.json | python3 -m json.tool

{
    "capabilities": {
        "pushNotifications": false,
        "stateTransitionHistory": true,
        "streaming": true
    },
    "name": "sre_orchestrator",
    "protocolVersion": "0.3.0",
    "skills": [],
    "url": "http://sre-orchestrator.kagent:8080"
}

The sre-orchestrator has "skills": [], empty. This is because I didn't define any a2aConfig.skills in my custom agent YAML. The k8s-agent has 3 skills defined in its Helm chart values. In prod, you'd want to define skills for your custom agents to make them discoverable.

The protocolVersion: "0.3.0" confirms both agents use the same A2A protocol version, ensuring compatibility.

21 > Step 17: kagent UI

The kagent UI is accessible at http://<node-ip>:8080 after port-forwarding:

k port-forward svc/kagent-ui 8080:8080 -n kagent --address 0.0.0.0 &

22 > Step 18: Memory CRD and Limitations

The Memory CRD Schema

k get crd memories.kagent.dev -o yaml | grep -A 10 "provider:"

provider:
  default: Pinecone
  description: The provider of the memory
  enum:
  - Pinecone
  type: string

The Memory CRD in kagent v0.8.6 supports only Pinecone as the memory provider. Pinecone is a managed vector database service.

This means long-term agent memory requires:

A Pinecone account and API key
A Pinecone index configured for the agent's embeddings

In my test env, I didn't have Pinecone configured, so the Memory CRD returned no resources. When an agent says "I will remember that...", it's only remembering within the current session (stored in PostgreSQL). Cross-session memory requires Pinecone.

Session-Level Memory via PostgreSQL

Conversations are persisted in the bundled PostgreSQL instance. The kagent get session command shows all historical sessions. Within a session, the full conversation history is available to the agent (subject to context window limits).

kagent supports context compaction. When enabled, older messages in a long conversation are summarized to reduce token usage while preserving key information:

spec:
  declarative:
    context:
      compaction:
        compactionInterval: 5  # Summarize every 5 turns

23 > Honest Assessment

What Works Well

kagent:

The CRD-based architecture is genuinely powerful for prod deployments
10 prod-ready agents out of the box
The A2A protocol works reliably for multi-agent delegation
The kagent CLI provides a clean interface for both management and invocation
The UI is polished and functional
Session persistence in PostgreSQL gives full audit capability
Connecting to Nebius Token Factory (or any OpenAI-compatible API) works perfectly

HAMi:

GPU virtualization works as advertised: 1 L40S → 10 virtual GPUs
The scheduler correctly enforces limits (overcommit protection)
Prometheus metrics are available and informative
The device plugin log provides detailed allocation tracing
Integration with k3s works after the containerd config conflict is resolved

The combination:

kagent agents can read, understand, and explain HAMi GPU allocations
Agent-driven GPU workload deployment with HAMi annotations works
The full stack requires zero external AI provider dependencies

Known Limitations

kagent v0.8.6:

Memory CRD only supports Pinecone (no built-in vector DB option)
kmcp mcp init command creates empty projects without templates
Some tool calls produce JSON function call syntax as text rather than executing (intermittent)
ModelConfig endpoint Helm value requires manual baseUrl patch post-install
Pod log retrieval requires the exact pod name (including hash suffix)
spec.declarative.memory.enabled field not recognized (schema mismatch in docs)

HAMi v2.8.1:

HAMi WebUI requires separate installation (not included in main Helm chart)
libvgpu.so injection requires compatible base images. Pure ubuntu:22.04 without CUDA causes container crashes
The conflicting /etc/containerd/conf.d/99-nvidia.toml (version 1 format) must be removed before installation on k3s
hami-scheduler Pending status after upgrade (old scheduler pod slow to terminate) is cosmetic

> Conclusion

The combination is not just technically sound. It's architecturally coherent. Everything is K8s-native, everything is declarative, everything is auditable. The same kubectl that manages your deployments manages your AI agents. The same K8s scheduler that places your pods places your GPU workloads, with HAMi providing the virtualization layer. Both kagent and HAMi are actively developed CNCF sandbox projects. The issues I encountered are known and being worked on. The future of AI infrastructure is K8s-native. kagent + HAMi + Nebius shows what that looks like.

References

kagent GitHub: https://github.com/kagent-dev/kagent
kagent Doc: https://kagent.dev
HAMi GitHub: https://github.com/Project-HAMi/HAMi
HAMi Doc: https://project-hami.io
Nebius Token Factory: https://tokenfactory.nebius.com
Nebius AI Cloud: https://nebius.com
CNCF Sandbox Projects: https://landscape.cncf.io
Model Context Protocol: https://modelcontextprotocol.io
A2A Protocol: https://google.github.io/A2A

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
manifests		manifests
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

K8s-Native AI Agents with GPU Virtualization: kagent + HAMi

1 > Why This Combination

2 > Architecture Overview

3 > The Two CNCF Projects

kagent

HAMi

4 > My infrastructure: Nebius AI Cloud

5 > Step 1: Setting Up k3s on Nebius

Verifying the VM

Installing k3s

Configuring kubectl Access

Verifying

Installing Helm

6 > Step 2: Installing kagent with Helm

Creating the kagent Namespace

Understanding the Two-Chart Installation

Installing kagent CRDs

Installing kagent with Nebius Token Factory

Fixing the ModelConfig Endpoint

Verifying Pod Status

7 > Step 3: Connecting to Nebius Token Factory

Getting an API Key

Finding Available Models

Verifying the ModelConfig

8 > Step 4: Understanding kagent CRDs

Listing All kagent CRDs

The Agent CRD

The RemoteMCPServer CRD

Viewing All Resources at Once

9 > Step 5: Installing HAMi GPU Virtualization

Understanding k3s's containerd Config

The Critical Fix: Removing the Conflicting Config

Adding Helm Repo and Installing HAMi

Verifying HAMi Installation

10 > Step 6: HAMi + containerd Integration

How HAMi Allocates GPUs

The libvgpu.so Mechanism

11 > Step 7: Verifying GPU Virtualization

Checking Node Allocatable Resources

12 > Step 8: Agent Testing (k8s-agent)

First Successful Test: Cluster Status

Health Check Test

Node GPU Check (Before HAMi)

Node GPU Check (After HAMi)

Self-Inspection Test

Deployment Creation Test

Deployment Deletion Test

13 > Step 9: Agent Testing (helm-agent)

14 > Step 10: Custom Agent Creation (sre-orchestrator)

15 > Step 11: A2A Protocol (Agent-to-Agent Communication)

How A2A Works

The A2A Test

16 > Step 12: kagent Manages HAMi GPU Workloads

Agent-Requested GPU Pod Deployment

Dual-Pod GPU Sharing

kagent Analyzes the GPU Sharing

17 > Step 13: GPU Overcommit Protection

Testing HAMi's Enforcement

18 > Step 14: HAMi Metrics and Observability

The HAMi Prometheus Metrics Endpoint

kagent Analyzes HAMi Metrics

19 > Step 15: kagent CLI Deep Dive

Listing Agents

Listing Sessions

Listing Tools

20 > Step 16: A2A Agent Card Protocol

k8s-agent Card

sre-orchestrator Card

21 > Step 17: kagent UI

22 > Step 18: Memory CRD and Limitations

The Memory CRD Schema

Session-Level Memory via PostgreSQL

23 > Honest Assessment

What Works Well

Known Limitations

> Conclusion

Packages