Skip to content

Commit a6a4f36

Browse files
sozercanathreesh
andauthored
docs: update aks guide (#3651)
Signed-off-by: Sertac Ozercan <[email protected]> Co-authored-by: Anish <[email protected]>
1 parent 135bce4 commit a6a4f36

File tree

1 file changed

+45
-167
lines changed

1 file changed

+45
-167
lines changed
Lines changed: 45 additions & 167 deletions
Original file line numberDiff line numberDiff line change
@@ -1,200 +1,78 @@
11
# Dynamo on AKS
22

3+
This guide covers deploying Dynamo and running LLM inference on Azure Kubernetes Service (AKS). You'll learn how to set up an AKS cluster with GPU nodes, install required components, and deploy your first model.
34

4-
This document covers the process of deploying Dynamo Cloud and running inference in a vLLM distributed runtime within a Azure Kubernetes environment, covering the setup process on a Azure Kubernetes Cluster, all the way from setup to testing inference.
5+
## Prerequisites
56

7+
Before you begin, ensure you have:
68

7-
### Task 1. Infrastructure Deployment
9+
- An active Azure subscription
10+
- Sufficient Azure quota for GPU VMs
11+
- [kubectl](https://kubernetes.io/docs/tasks/tools/) installed
12+
- [Helm](https://helm.sh/docs/intro/install/) installed
813

9-
1. Open **Azure Cloud Shell** or a ternimal on an Azure VM and install pre-reqs:
10-
```
11-
az login
12-
13-
az extension add --name aks-preview
14-
az extension update --name aks-preview
15-
```
16-
17-
generate an rsa ssh key for using with aks cluster:
18-
```
19-
ssh-keygen -t rsa -b 4096 -C "<[email protected]>"
20-
```
21-
22-
2. Create AKS Cluster
23-
```
24-
export REGION=<region>
25-
export RESOURCE_GROUP=<rg_name>
26-
export ZONE=<zone>
27-
export CLUSTER_NAME=<aks_cluster_name>
28-
export CPU_COUNT=1
29-
30-
az aks create -g $RESOURCE_GROUP -n $CLUSTER_NAME --location $REGION --zones $ZONE --node-count $CPU_COUNT --enable-node-public-ip --ssh-key-value /home/user/.ssh/id_rsa.pub
31-
```
32-
33-
3. Check if it was created correctly
34-
``` bash
35-
# Get Credentials
36-
az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
37-
38-
kubectl config get-contexts
39-
40-
#You should see output like this:
41-
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
42-
* dynamo-aks dynamo-aks clusterUser_<rg_name>_<aks_cluster_name>
43-
```
44-
45-
4. Create GPU node pool: You can use as many computes of whatever SKU you want, here we have used 4 nodes of standard_nc24ads_a100_v4, which have 1 A100 each.
46-
```
47-
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME --name gpupool --node-count 4 --skip-gpu-driver-install --node-vm-size standard_nc24ads_a100_v4 --node-osdisk-size 2048 --max-pods 110
48-
```
49-
50-
### Task 2. Install Nvidia GPU Operator
51-
52-
Once your AKS cluster is configured with a GPU-enabled node pool, we can proceed with setting up the NVIDIA GPU Operator. This operator automates the deployment and lifecycle of all NVIDIA software components required to provision GPUs in the Kubernetes cluster. The NVIDIA GPU operator enables the infrastructure to support GPU workloads like LLM inference and embedding generation.
53-
54-
1. Add the NVIDIA Helm repository:
55-
```
56-
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials && helm repo update
57-
```
58-
59-
2. Install the GPU Operator:
60-
```
61-
helm install --create-namespace --namespace gpu-operator nvidia/gpu-operator --wait --generate-name
62-
```
14+
## Step 1: Create AKS Cluster with GPU Nodes
6315

64-
3. Validate install (Takes about 5 mins to complete):
65-
```
66-
kubectl get pods -A -o wide
67-
```
16+
If you don't have an AKS cluster yet, create one using the [Azure CLI](https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli), [Azure PowerShell](https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-powershell), or the [Azure portal](https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-portal).
6817

69-
You should see output similar to the example below. Note that this is not the complete output, there should be additional pods running. The most important thing is to verify that the GPU Operator pods are in a `Running` state.
18+
Ensure your AKS cluster has a node pool with GPU-enabled nodes. Follow the [Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)](https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu?tabs=add-ubuntu-gpu-node-pool#skip-gpu-driver-installation) guide to create a GPU-enabled node pool.
7019

71-
```
72-
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
73-
gpu-operator gpu-operator-xxxx-node-feature-discovery-gc-xxxxxxxxx 1/1 Running 0 40s 10.244.0.194 aks-nodepool1-xxxx
74-
gpu-operator gpu-operator-xxxx-node-feature-discovery-master-xxxxxxxxx 1/1 Running 0 40s 10.244.0.200 aks-nodepool1-xxxx
75-
gpu-operator gpu-operator-xxxx-node-feature-discovery-worker-xxxxxxxxx 1/1 Running 0 40s 10.244.0.190 aks-nodepool1-xxxx
76-
gpu-operator gpu-operator-xxxxxxxxxxxxxx 1/1 Running 0 40s 10.244.0.128 aks-nodepool1-xxxx
77-
```
20+
**Important:** It is recommended to **skip the GPU driver installation** during node pool creation, as the NVIDIA GPU Operator will handle this in the next step.
7821

79-
For additional guidance on setting up GPU node pools in AKS, refer to the [Microsoft Docs](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool).
22+
## Step 2: Install NVIDIA GPU Operator
8023

81-
### Task 3. Configure Dynamo
24+
Once your AKS cluster is configured with a GPU-enabled node pool, install the NVIDIA GPU Operator. This operator automates the deployment and lifecycle of all NVIDIA software components required to provision GPUs in the Kubernetes cluster, including drivers, container toolkit, device plugin, and monitoring tools.
8225

83-
1. Pull Dynamo Repo
84-
The Dynamo GitHub repository will be leveraged extensively throughout this walkthrough. Pull the repository using:
85-
```bash
86-
# clone Dynamo GitHub repo
87-
git clone https://github.com/ai-dynamo/dynamo.git
26+
Follow the [Installing the NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) guide to install the GPU Operator on your AKS cluster.
8827

89-
# go to root of Dynamo repo, latest commit at the time of writing this document was 22e6c96f715177c776421c90e9415a7dbc4f661a
90-
cd dynamo
91-
```
28+
You should see output similar to the example below. Note that this is not the complete output; there should be additional pods running. The most important thing is to verify that the GPU Operator pods are in a `Running` state.
9229

93-
2. Install Dynamo from Published Artifacts on NGC (see the [Dynamo Cloud guide](../../../docs/kubernetes/installation_guide.md)):
9430
```bash
95-
export NAMESPACE=dynamo-cloud
96-
export RELEASE_VERSION=0.3.2
97-
98-
#The above linked document says to authenticate using NGC_API_KEY, not neccessary, since this is an openly available container
99-
100-
# Fetch the CRDs helm chart
101-
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
102-
103-
# Fetch the platform helm chart
104-
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
105-
106-
# Step 1: Install Custom Resource Definitions (CRDs)
107-
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
108-
--namespace default \
109-
--wait \
110-
--atomic
111-
112-
#Step 2: Install Dynamo Platform
113-
kubectl create namespace ${NAMESPACE}
114-
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE}
115-
116-
# Check pod status:
117-
kubectl get pods -n $NAMESPACE
118-
119-
# output should be similar
120-
NAME READY STATUS RESTARTS AGE
121-
dynamo-platform-dynamo-operator-controller-manager-549b5d5xf7rv 2/2 Running 0 2m50s
122-
dynamo-platform-etcd-0 1/1 Running 0 2m50s
123-
dynamo-platform-nats-0 2/2 Running 0 2m50s
124-
dynamo-platform-nats-box-5dbf45c748-kln82 1/1 Running 0 2m51s
31+
NAMESPACE NAME READY STATUS RESTARTS AGE
32+
gpu-operator gpu-feature-discovery-xxxxx 1/1 Running 0 2m
33+
gpu-operator gpu-operator-xxxxx 1/1 Running 0 2m
34+
gpu-operator nvidia-container-toolkit-daemonset-xxxxx 1/1 Running 0 2m
35+
gpu-operator nvidia-cuda-validator-xxxxx 0/1 Completed 0 1m
36+
gpu-operator nvidia-device-plugin-daemonset-xxxxx 1/1 Running 0 2m
37+
gpu-operator nvidia-driver-daemonset-xxxxx 1/1 Running 0 2m
12538
```
12639

127-
There are other ways to install Dynamo, you can find them [here](../../../docs/kubernetes/installation_guide.md).
128-
129-
### Task 4. Deploy a model
40+
## Step 3: Deploy Dynamo Kubernetes Operator
13041

131-
We're going to be deploying MSFTs Phi-3.5-vision-instruct. You can alter this flow to deploy whatever model you need.
42+
Follow the [Deploying Inference Graphs to Kubernetes](../../../docs/kubernetes/README.md) guide to install Dynamo on your AKS cluster.
13243

133-
Refer: [dynamo/docs/examples/README.md at main · ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo/blob/main/docs/examples/README.md)
44+
Validate that the Dynamo pods are running:
13445

13546
```bash
136-
# Set your dynamo root directory
137-
cd <root-dynamo-folder>
138-
export PROJECT_ROOT=$(pwd)
139-
140-
# Create a Kubernetes secret containing your sensitive values:
141-
export HF_TOKEN=your_hf_token
142-
kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=${HF_TOKEN} -n ${NAMESPACE}
143-
144-
# Deploying an example (Time taken depends on model, phi3v took ~5mins)
145-
# You can edit the number os replicas of encoder/ decoder independently here to suit your deployment needs
146-
147-
kubectl apply -f examples/multimodal/deploy/k8s/agg-phi3v.yaml -n ${NAMESPACE}
148-
149-
# Get status of deployment
150-
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
47+
kubectl get pods -n dynamo-system
15148

152-
# You can use any of the following commands to see logs for debugging
153-
kubectl get pods -n ${NAMESPACE} -o wide
154-
kubectl logs <pod-name> -n ${NAMESPACE}
155-
kubectl exec -it <pod-name> -n ${NAMESPACE} -- nvidia-smi
156-
157-
# Enable Port forwarding to be able to hit a curl request
158-
kubectl get svc -n ${NAMESPACE}
159-
160-
#Look for one that ends in -frontend and use it for port forward.
161-
SERVICE_NAME=$(kubectl get svc -n ${NAMESPACE} -o name | grep frontend | sed 's|.*/||' | sed 's|-frontend||' | head -n1)
162-
kubectl port-forward svc/${SERVICE_NAME}-frontend 8000:8000 -n ${NAMESPACE} &
49+
# Expected output:
50+
# NAME READY STATUS RESTARTS AGE
51+
# dynamo-platform-dynamo-operator-controller-manager-xxxxxxxxxx 2/2 Running 0 2m50s
52+
# dynamo-platform-etcd-0 1/1 Running 0 2m50s
53+
# dynamo-platform-nats-0 2/2 Running 0 2m50s
54+
# dynamo-platform-nats-box-xxxxxxxxxx 1/1 Running 0 2m51s
16355
```
16456

165-
#### Task 5. Testing
57+
## Step 4: Deploy and Test a Model
16658

167-
```
168-
curl localhost:8000/v1/chat/completions \
169-
-H "Content-Type: application/json" \
170-
-d '{
171-
"model": "microsoft/Phi-3.5-vision-instruct",
172-
"messages": [
173-
{
174-
"role": "user",
175-
"content": [
176-
{ "type": "text", "text": "What is in this image?" },
177-
{ "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
178-
]
179-
}
180-
],
181-
"stream": false
182-
}'
183-
184-
#Output should be something like:
185-
{"id": "a200785a-a4dd-4208-8ced-2d0ea30351a4", "object": "chat.completion", "created": 1753223375, "model": "microsoft/Phi-3.5-vision-instruct", "choices": [{"index": 0, "message": {"role": "assistant", "content": " The image features a wooden boardwalk extending into a grassy area surrounded by a wetland. There are water lilies in the water, and the sky is clear with a few clouds. The sun is shining, casting light on the scene, and there are trees visible in the background."}, "finish_reason": "stop"}]}
186-
```
59+
Follow the [Deploy Model/Workflow](../../../docs/kubernetes/installation_guide.md#next-steps) guide to deploy and test a model on your AKS cluster.
18760

18861
## Clean Up Resources
18962

190-
In order to clean up any Dynamo related resources, from the container shell you launched the deployment from, simply run the following command:
63+
If you want to clean up the Dynamo resources created during this guide, you can run the following commands:
19164

19265
```bash
193-
# Delete deployment
194-
kubectl delete dynamoGraphDeployment <your-dep-name> -n ${NAMESPACE}
66+
# Delete all Dynamo Graph Deployments
67+
kubectl delete dynamographdeployments.nvidia.com --all --all-namespaces
19568

196-
# Delete the AKS Cluster
197-
az aks delete --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP --yes
69+
# Uninstall Dynamo Platform and CRDs
70+
helm uninstall dynamo-platform -n dynamo-kubernetes
71+
helm uninstall dynamo-crds -n default
19872
```
19973

200-
This will spin down the Dynamo deployment we configured and spin down all the resources that were leveraged for the deployment.
74+
This will spin down the Dynamo deployment and all associated resources.
75+
76+
If you want to delete the GPU Operator, follow the instructions in the [Uninstalling the NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/uninstall.html) guide.
77+
78+
If you want to delete the entire AKS cluster, follow the instructions in the [Delete an AKS cluster](https://learn.microsoft.com/en-us/azure/aks/delete-cluster) guide.

0 commit comments

Comments
 (0)