Conversation
WalkthroughAdds a new how-to guide documenting KEDA-based autoscaling for vLLM inference services, covering prerequisites, RBAC for kserve-controller-manager, Prometheus TriggerAuthentication, InferenceService manifest changes (RawDeployment, external autoscalerClass, Prometheus metric), example PromQL, and verification steps. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator
participant Git/Manifest
participant KServe_Controller as KServe Controller
participant KEDA
participant Prometheus
participant KubernetesAPI as K8s API
Operator->>Git/Manifest: prepare InferenceService manifest (RawDeployment, autoscalerClass: external, Prometheus metric)
Operator->>K8s API: apply TriggerAuthentication and secret
Operator->>K8s API: apply InferenceService manifest
KServe_Controller->>K8s API: create Deployment (stopped if HPA conflict)
KEDA->>Prometheus: evaluate PromQL (external metric)
Prometheus-->>KEDA: metric value
KEDA->>K8s API: update HPA / scale Deployment
K8s API-->>KServe_Controller: replica count change
KServe_Controller-->>Operator: status / verification (ScaledObject, HPA, replicas)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx (3)
37-62: Consider narrowing RBAC permissions for better security.The ClusterRole grants wildcard access to all KEDA resources. For production environments, consider specifying only the required resources and verbs. KServe typically needs access to
scaledobjectsandtriggerauthenticationswithget,list,watch,create,update,patch, anddeleteverbs.🔒 Suggested refinement for least-privilege RBAC
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kserve-keda-manager-role rules: - apiGroups: - keda.sh resources: - - "*" + - scaledobjects + - triggerauthentications verbs: - - "*" + - get + - list + - watch + - create + - update + - patch + - delete🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around lines 37 - 62, The ClusterRole kserve-keda-manager-role currently uses wildcard resources/verbs; replace the "*" resource/verb entries with explicit KEDA resources and verbs needed by KServe (e.g., resources: scaledobjects, triggerauthentications and verbs: get, list, watch, create, update, patch, delete) so the role grants least privilege, and ensure the existing ClusterRoleBinding referencing the ServiceAccount kserve-controller-manager in namespace kserve remains intact to bind the narrowed role.
99-105: Consider verifying the source secret exists before copying.The command assumes the secret
kube-prometheus-alertmanager-basic-authexists in thecpaas-systemnamespace. Adding a verification step would make the instructions more robust and provide clearer error messages if the secret is missing.💡 Suggested verification step
Add this before the secret creation command:
# Verify the source secret exists kubectl get secret kube-prometheus-alertmanager-basic-auth -n cpaas-system # Then create the secret in your namespace kubectl create secret generic prom-basic-auth-secret \ --namespace=<your-namespace> \ --from-literal=username=$(kubectl get secret kube-prometheus-alertmanager-basic-auth \ -n cpaas-system -o jsonpath='{.data.username}' | base64 -d) \ --from-literal=password=$(kubectl get secret kube-prometheus-alertmanager-basic-auth \ -n cpaas-system -o jsonpath='{.data.password}' | base64 -d)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around lines 99 - 105, The docs assume the secret kube-prometheus-alertmanager-basic-auth exists in cpaas-system; add a verification step before the kubectl create secret command to check for that source secret (e.g., run kubectl get secret kube-prometheus-alertmanager-basic-auth -n cpaas-system and handle a non-zero exit or missing output) and only proceed to the create secret command that uses --from-literal=username=$(kubectl get secret ... -o jsonpath='{.data.username}' | base64 -d) and --from-literal=password=$(kubectl get secret ... -o jsonpath='{.data.password}' | base64 -d) if the source secret is present, documenting the error message to show when the source secret is missing.
163-163: Consider explaining the target value choice.The target value of
'1'means KEDA will scale to have one running request per replica. While the formula is explained in callout#5, adding context about why'1'is chosen (e.g., one concurrent request per replica, adjust based on GPU capacity) would help users understand if they should modify this value for their workload.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` at line 163, Update the documentation near the KEDA scaler configuration where "value: '1'" is specified to explain why the target is set to '1' and when to change it: state that this target means one running request per replica (referencing the formula in callout `#5`), recommend increasing the value for GPUs or high-concurrency-capable replicas and decreasing for CPU-bound or memory-constrained workloads, and provide a brief guideline (e.g., choose based on expected concurrent requests per replica and hardware capacity) so readers can decide whether to adjust "value: '1'".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx`:
- Around line 166-172: Update Callout `#2` to clarify that "prom-basic-auth" in
the example is the TriggerAuthentication resource name and the actual Secret
resource is "prom-basic-auth-secret"; instruct readers to replace the Secret
name (prom-basic-auth-secret) with their secret and, if needed, the
TriggerAuthentication name (prom-basic-auth) with their TriggerAuthentication
resource—refer to the Callout `#2` text and the example Secret identifier
prom-basic-auth-secret shown earlier to make this distinction explicit.
- Around line 158-159: Update the PromQL query placeholder to match the
document's consistent naming: replace isvc_name="<your-model-name>" with
isvc_name="<your-isvc-name>" in the query string so it aligns with other
references; locate the query shown under the vllm:num_requests_running example
and change only the placeholder token to "<your-isvc-name>" to preserve
consistency.
---
Nitpick comments:
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx`:
- Around line 37-62: The ClusterRole kserve-keda-manager-role currently uses
wildcard resources/verbs; replace the "*" resource/verb entries with explicit
KEDA resources and verbs needed by KServe (e.g., resources: scaledobjects,
triggerauthentications and verbs: get, list, watch, create, update, patch,
delete) so the role grants least privilege, and ensure the existing
ClusterRoleBinding referencing the ServiceAccount kserve-controller-manager in
namespace kserve remains intact to bind the narrowed role.
- Around line 99-105: The docs assume the secret
kube-prometheus-alertmanager-basic-auth exists in cpaas-system; add a
verification step before the kubectl create secret command to check for that
source secret (e.g., run kubectl get secret
kube-prometheus-alertmanager-basic-auth -n cpaas-system and handle a non-zero
exit or missing output) and only proceed to the create secret command that uses
--from-literal=username=$(kubectl get secret ... -o jsonpath='{.data.username}'
| base64 -d) and --from-literal=password=$(kubectl get secret ... -o
jsonpath='{.data.password}' | base64 -d) if the source secret is present,
documenting the error message to show when the source secret is missing.
- Line 163: Update the documentation near the KEDA scaler configuration where
"value: '1'" is specified to explain why the target is set to '1' and when to
change it: state that this target means one running request per replica
(referencing the formula in callout `#5`), recommend increasing the value for GPUs
or high-concurrency-capable replicas and decreasing for CPU-bound or
memory-constrained workloads, and provide a brief guideline (e.g., choose based
on expected concurrent requests per replica and hardware capacity) so readers
can decide whether to adjust "value: '1'".
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4f745cb0-800b-4b16-b4c3-d6cb7fbbc79b
📒 Files selected for processing (1)
docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx
docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx
Outdated
Show resolved
Hide resolved
| <Callouts> | ||
| 1. Disables the built-in KServe HPA and delegates scaling to KEDA. | ||
| 2. References the `Secret` that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual secret. | ||
| 3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values. | ||
| 4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`. | ||
| 5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas. | ||
| </Callouts> |
There was a problem hiding this comment.
Clarify the callout about authentication reference.
Callout #2 (line 168) states "Replace prom-basic-auth with the name of your actual secret," but prom-basic-auth is the TriggerAuthentication resource name, not the Secret name. The Secret name is prom-basic-auth-secret (as shown in line 99). This could confuse readers.
📝 Suggested clarification
<Callouts>
1. Disables the built-in KServe HPA and delegates scaling to KEDA.
-2. References the `Secret` that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual secret.
+2. References the `TriggerAuthentication` resource that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual TriggerAuthentication.
3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values.
4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`.
5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas.
</Callouts>📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <Callouts> | |
| 1. Disables the built-in KServe HPA and delegates scaling to KEDA. | |
| 2. References the `Secret` that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual secret. | |
| 3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values. | |
| 4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`. | |
| 5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas. | |
| </Callouts> | |
| <Callouts> | |
| 1. Disables the built-in KServe HPA and delegates scaling to KEDA. | |
| 2. References the `TriggerAuthentication` resource that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual TriggerAuthentication. | |
| 3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values. | |
| 4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`. | |
| 5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas. | |
| </Callouts> |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 166 - 172, Update Callout `#2` to clarify that "prom-basic-auth" in the
example is the TriggerAuthentication resource name and the actual Secret
resource is "prom-basic-auth-secret"; instruct readers to replace the Secret
name (prom-basic-auth-secret) with their secret and, if needed, the
TriggerAuthentication name (prom-basic-auth) with their TriggerAuthentication
resource—refer to the Callout `#2` text and the example Secret identifier
prom-basic-auth-secret shown earlier to make this distinction explicit.
Deploying alauda-ai with
|
| Latest commit: |
b2e3323
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://5fbe4941.alauda-ai.pages.dev |
| Branch Preview URL: | https://add-keda.alauda-ai.pages.dev |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx`:
- Around line 77-84: Add a step to remove the stop annotation so the
InferenceService can resume after KEDA config: after the "Configure the
InferenceService for KEDA" section and before "Verify the Setup", instruct the
user to remove the serving.kserve.io/stop annotation (i.e., kubectl annotate
inferenceservice <your-isvc-name> -n <your-namespace> serving.kserve.io/stop-)
so the InferenceService can start and KEDA autoscaling will take effect;
reference the annotation key serving.kserve.io/stop and the sections "Configure
the InferenceService for KEDA" and "Verify the Setup" to show where to add this
step.
- Around line 19-21: The info box claims KEDA can "scale from zero to N
instances" but the guide configures an InferenceService in RawDeployment mode
which uses Kubernetes HPA and does not support scale-from-zero; update the text
in the info box to either remove the scale-from-zero claim or qualify it by
adding a line such as "KEDA can scale from zero to N when used with Knative
mode; this guide demonstrates metric-based autoscaling with RawDeployment, which
does not support scaling from zero (RawDeployment uses HPA and lacks Knative's
activator)." Ensure you reference RawDeployment and InferenceService so readers
know the limitation applies to this configuration and maintain the existing KEDA
vs Knative distinction.
- Around line 35-36: The RBAC example currently gives kserve-controller-manager
blanket KEDA access via resources: ["*"] and verbs: ["*"]; replace that wildcard
rule in the Role/ClusterRole snippet that references keda.sh with a scoped rule
that lists only the required apiGroup keda.sh and resources scaledobjects,
scaledobjects/finalizers, scaledobjects/status and the verbs create, delete,
get, list, watch, patch, update so the Role assigned to
kserve-controller-manager follows least-privilege; update the YAML example where
kserve-controller-manager and KEDA permissions are declared to use this explicit
list instead of wildcards.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 93f8a5b5-66f6-422a-84c7-07862caca9bc
📒 Files selected for processing (1)
docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx
| :::info | ||
| KEDA extends the standard Kubernetes Horizontal Pod Autoscaler (HPA), allowing applications to scale from zero to N instances and back down based on a wide variety of event sources—including Prometheus metrics. It introduces an open and extensible framework so that KServe can scale on virtually any signal relevant to your AI model's performance. | ||
| ::: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | head -100Repository: alauda/aml-docs
Length of output: 4712
🌐 Web query:
KServe scale from zero HTTP requests Standard mode
💡 Result:
In KServe “Standard” deployment mode (a.k.a. RawDeployment in some UIs/docs), scale-from-zero based on incoming HTTP requests is not supported. Standard mode runs model servers as regular Kubernetes Deployments behind Gateway/Ingress, so there’s no request-buffering “activator” component to hold an HTTP request while the Deployment scales up from 0 replicas. KServe’s docs explicitly note that request-based autoscaling with scale-to-zero isn’t supported in Standard mode. [1], [2]
If you need HTTP scale-to-zero / scale-from-zero, use KServe + Knative (Serverless/Knative mode), which provides the request-driven autoscaling and scale-to-zero behavior. [2], [3]
Sources:
[1] KServe Kubernetes Deployment Installation Guide (Standard mode note about scale-from-zero for HTTP)
[2] KServe Control Plane / Standard vs Knative mode (Knative provides scale-to-zero)
[3] KServe GitHub README (standard install does not support request-based autoscaling with scale-to-zero)
🌐 Web query:
KServe InferenceService RawDeployment scale-from-zero
💡 Result:
In KServe RawDeployment mode, an InferenceService cannot scale down to 0 or scale up from 0. RawDeployment uses Kubernetes HPA (rather than Knative’s autoscaling), and “scale down and from zero is not supported” in this mode. [1]
If you need scale-from-zero, use Knative/Serverless (KServe “Standard”) autoscaling (KPA) and set minReplicas: 0. KServe docs note you can enable scale-to-zero by setting minReplicas to 0. [2]
Example (KPA / scale-to-zero):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: mymodel
spec:
predictor:
minReplicas: 0
maxReplicas: 5
# ... model/runtime config ...Workarounds in RawDeployment: you can still autoscale with HPA (or external autoscalers), but not request-driven scale-from-zero like Knative provides. [1]
Sources:
[1] KServe docs — RawDeployment autoscaling (HPA) and “scale down and from zero is not supported” https://kserve.github.io/archive/0.15/modelserving/autoscaling/raw_deployment_autoscaling/
[2] KServe docs — KPA autoscaling; minReplicas: 0 enables scale-to-zero https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/kpa-autoscaler
🏁 Script executed:
cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | tail -200Repository: alauda/aml-docs
Length of output: 9080
Qualify the scale-from-zero claim—it is not supported in RawDeployment mode.
The info box (lines 19–21) states that KEDA allows applications to "scale from zero to N instances," but this guide configures an InferenceService in RawDeployment mode (line 28), which does not support scaling down to zero or up from zero. RawDeployment uses Kubernetes HPA, not Knative's autoscaling (KPA), and lacks the request-buffering "activator" component needed for scale-from-zero. The guide achieves metric-based autoscaling only, not scale-from-zero capability.
Either remove the scale-from-zero promise from the info box, or add a clear qualification: "KEDA can scale from zero to N instances when used with Knative mode; this guide demonstrates metric-based autoscaling with RawDeployment, which does not support scaling from zero."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 19 - 21, The info box claims KEDA can "scale from zero to N instances" but
the guide configures an InferenceService in RawDeployment mode which uses
Kubernetes HPA and does not support scale-from-zero; update the text in the info
box to either remove the scale-from-zero claim or qualify it by adding a line
such as "KEDA can scale from zero to N when used with Knative mode; this guide
demonstrates metric-based autoscaling with RawDeployment, which does not support
scaling from zero (RawDeployment uses HPA and lacks Knative's activator)."
Ensure you reference RawDeployment and InferenceService so readers know the
limitation applies to this configuration and maintain the existing KEDA vs
Knative distinction.
| Before proceeding, apply the following RBAC resources to allow `kserve-controller-manager` to manage KEDA objects (`ScaledObject`, `TriggerAuthentication`, etc.): | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | head -60Repository: alauda/aml-docs
Length of output: 3115
🏁 Script executed:
# Search for KEDA resource references in the codebase
rg -i "scaledObject|triggerAuthentication|trigger" --type yaml --type ymlRepository: alauda/aml-docs
Length of output: 86
🌐 Web query:
KServe KEDA integration permissions required resources
💡 Result:
KServe ↔ KEDA integration: what permissions/resources are required?
1) KServe controller RBAC (what it must be allowed to manage for KEDA)
In upstream KServe, the controller ClusterRole (kserve-manager-role) includes these KEDA-specific permissions (needed so KServe can create/manage the KEDA objects it generates when you set serving.kserve.io/autoscalerClass: "keda"):
- API group:
keda.sh - Resources & verbs:
scaledobjects,scaledobjects/finalizers:create, delete, get, list, watch, patch, updatescaledobjects/status:get, patch, update[1]
(That same role also grants KServe permissions it needs around the scale target and related plumbing, e.g. deployments.apps and horizontalpodautoscalers.autoscaling, but the items above are the KEDA integration–specific pieces.) [1]
2) KEDA resources that must exist in the cluster
KServe’s KEDA autoscaling requires KEDA installed (i.e., KEDA CRDs/controllers available). KServe then creates a ScaledObject per predictor/deployment when configured. [2]
3) If you scale from Prometheus/custom metrics: extra namespaced resources
When using Prometheus (or other external metric backends), you typically also create in the same namespace as the InferenceService:
Secret(credentials/CA/bearer token, if needed)TriggerAuthenticationreferencing that Secret [2]
Also note: TriggerAuthentication is namespaced and (by default) can only be used by a ScaledObject in the same namespace. [3]
Sources:
[1] KServe controller RBAC (kserve-manager-role) showing KEDA ScaledObject permissions (raw.githubusercontent.com)
[2] KServe “Autoscaling with KEDA” docs (ScaledObject creation + TriggerAuthentication/Secret example) (kserve.github.io)
[3] KEDA Authentication docs (TriggerAuthentication namespace scoping) (keda.sh)
Citations:
- 1: https://raw.githubusercontent.com/kserve/kserve/master/config/rbac/role.yaml
- 2: https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/keda-autoscaler
- 3: https://keda.sh/docs/2.19/concepts/authentication/?utm_source=openai
Scope KEDA permissions to minimal required resources and verbs.
Lines 45-48 use resources: ["*"] and verbs: ["*"], granting kserve-controller-manager blanket access to all KEDA resources and actions. Per upstream KServe RBAC configuration, the controller only needs the following scoped permissions:
- apiGroups:
- keda.sh
resources:
- scaledobjects
- scaledobjects/finalizers
- scaledobjects/status
verbs:
- create
- delete
- get
- list
- watch
- patch
- updateThis applies to lines 42–48. Update the example to follow least privilege by default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 35 - 36, The RBAC example currently gives kserve-controller-manager
blanket KEDA access via resources: ["*"] and verbs: ["*"]; replace that wildcard
rule in the Role/ClusterRole snippet that references keda.sh with a scoped rule
that lists only the required apiGroup keda.sh and resources scaledobjects,
scaledobjects/finalizers, scaledobjects/status and the verbs create, delete,
get, list, watch, patch, update so the Role assigned to
kserve-controller-manager follows least-privilege; update the YAML example where
kserve-controller-manager and KEDA permissions are declared to use this explicit
list instead of wildcards.
| ### Stop the Running InferenceService | ||
|
|
||
| Before making changes, stop the running `InferenceService` to avoid conflicts between the existing HPA and the new KEDA-managed scaler. Add the following annotation to stop it: | ||
|
|
||
| ```bash | ||
| kubectl annotate inferenceservice <your-isvc-name> -n <your-namespace> \ | ||
| serving.kserve.io/stop='true' | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | head -100Repository: alauda/aml-docs
Length of output: 4712
🏁 Script executed:
cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | sed -n '70,220p'Repository: alauda/aml-docs
Length of output: 6772
🏁 Script executed:
grep -n "serving.kserve.io/stop" docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdxRepository: alauda/aml-docs
Length of output: 93
Add an explicit step to remove the stop annotation before the service can start.
The guide stops the InferenceService with serving.kserve.io/stop='true' but never removes this annotation. Without clearing it after applying the updated manifest, the service remains stopped and KEDA autoscaling cannot function. Add the following after the "Configure the InferenceService for KEDA" section and before "Verify the Setup":
Suggested addition
The HPA output will show the current metric value, the scaling threshold, and the current/desired replica counts. As inference traffic increases, the `TARGETS` value will rise and replicas will scale up automatically.
+### Remove the Stop Annotation
+
+After applying the updated `InferenceService` manifest with KEDA configuration, remove the temporary stop annotation so the service can start and be managed by KEDA:
+
+```bash
+kubectl annotate inferenceservice <your-isvc-name> -n <your-namespace> serving.kserve.io/stop-
+```
+
</Steps>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 77 - 84, Add a step to remove the stop annotation so the InferenceService
can resume after KEDA config: after the "Configure the InferenceService for
KEDA" section and before "Verify the Setup", instruct the user to remove the
serving.kserve.io/stop annotation (i.e., kubectl annotate inferenceservice
<your-isvc-name> -n <your-namespace> serving.kserve.io/stop-) so the
InferenceService can start and KEDA autoscaling will take effect; reference the
annotation key serving.kserve.io/stop and the sections "Configure the
InferenceService for KEDA" and "Verify the Setup" to show where to add this
step.
Summary by CodeRabbit