Skip to content

Add howto integrate KEDA#133

Open
fyuan1316 wants to merge 2 commits intomasterfrom
add-keda
Open

Add howto integrate KEDA#133
fyuan1316 wants to merge 2 commits intomasterfrom
add-keda

Conversation

@fyuan1316
Copy link
Contributor

@fyuan1316 fyuan1316 commented Mar 9, 2026

Summary by CodeRabbit

  • Documentation
    • Added a step-by-step guide for enabling KEDA-based autoscaling of inference services: prerequisites, RBAC setup, Prometheus TriggerAuthentication and metric configuration, deployment settings to enable external autoscaling, recommended vLLM metrics and example PromQL, verification steps, and operational tips for naming, queries, and monitoring replica behavior.

@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

Walkthrough

Adds a new how-to guide documenting KEDA-based autoscaling for vLLM inference services, covering prerequisites, RBAC for kserve-controller-manager, Prometheus TriggerAuthentication, InferenceService manifest changes (RawDeployment, external autoscalerClass, Prometheus metric), example PromQL, and verification steps.

Changes

Cohort / File(s) Summary
KEDA Autoscaling Documentation
docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx
New guide describing end-to-end setup: prerequisites (KServe, KEDA, Prometheus), RBAC to allow kserve-controller-manager access to KEDA, creating TriggerAuthentication and secrets, updating InferenceService to use RawDeployment + external autoscaler with Prometheus metric and per-replica target, example vLLM metrics/PromQL, and verification steps (ScaledObject, HPA, replica checks).

Sequence Diagram(s)

sequenceDiagram
participant Operator
participant Git/Manifest
participant KServe_Controller as KServe Controller
participant KEDA
participant Prometheus
participant KubernetesAPI as K8s API

Operator->>Git/Manifest: prepare InferenceService manifest (RawDeployment, autoscalerClass: external, Prometheus metric)
Operator->>K8s API: apply TriggerAuthentication and secret
Operator->>K8s API: apply InferenceService manifest
KServe_Controller->>K8s API: create Deployment (stopped if HPA conflict)
KEDA->>Prometheus: evaluate PromQL (external metric)
Prometheus-->>KEDA: metric value
KEDA->>K8s API: update HPA / scale Deployment
K8s API-->>KServe_Controller: replica count change
KServe_Controller-->>Operator: status / verification (ScaledObject, HPA, replicas)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through docs, a helpful little fling,
Wove KEDA, PromQL, and vLLM into one spring.
Secrets tucked, TriggerAuth set with care,
Autoscale hums — replicas dance in the air. 🥕📈

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add howto integrate KEDA' is partially related to the changeset—it references KEDA integration, which is the main topic, but uses awkward phrasing ('howto' instead of 'how to') and lacks specificity about autoscaling.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-keda

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx (3)

37-62: Consider narrowing RBAC permissions for better security.

The ClusterRole grants wildcard access to all KEDA resources. For production environments, consider specifying only the required resources and verbs. KServe typically needs access to scaledobjects and triggerauthentications with get, list, watch, create, update, patch, and delete verbs.

🔒 Suggested refinement for least-privilege RBAC
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
   name: kserve-keda-manager-role
 rules:
 - apiGroups:
   - keda.sh
   resources:
-  - "*"
+  - scaledobjects
+  - triggerauthentications
   verbs:
-  - "*"
+  - get
+  - list
+  - watch
+  - create
+  - update
+  - patch
+  - delete
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 37 - 62, The ClusterRole kserve-keda-manager-role currently uses wildcard
resources/verbs; replace the "*" resource/verb entries with explicit KEDA
resources and verbs needed by KServe (e.g., resources: scaledobjects,
triggerauthentications and verbs: get, list, watch, create, update, patch,
delete) so the role grants least privilege, and ensure the existing
ClusterRoleBinding referencing the ServiceAccount kserve-controller-manager in
namespace kserve remains intact to bind the narrowed role.

99-105: Consider verifying the source secret exists before copying.

The command assumes the secret kube-prometheus-alertmanager-basic-auth exists in the cpaas-system namespace. Adding a verification step would make the instructions more robust and provide clearer error messages if the secret is missing.

💡 Suggested verification step

Add this before the secret creation command:

# Verify the source secret exists
kubectl get secret kube-prometheus-alertmanager-basic-auth -n cpaas-system

# Then create the secret in your namespace
kubectl create secret generic prom-basic-auth-secret \
  --namespace=<your-namespace> \
  --from-literal=username=$(kubectl get secret kube-prometheus-alertmanager-basic-auth \
    -n cpaas-system -o jsonpath='{.data.username}' | base64 -d) \
  --from-literal=password=$(kubectl get secret kube-prometheus-alertmanager-basic-auth \
    -n cpaas-system -o jsonpath='{.data.password}' | base64 -d)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 99 - 105, The docs assume the secret
kube-prometheus-alertmanager-basic-auth exists in cpaas-system; add a
verification step before the kubectl create secret command to check for that
source secret (e.g., run kubectl get secret
kube-prometheus-alertmanager-basic-auth -n cpaas-system and handle a non-zero
exit or missing output) and only proceed to the create secret command that uses
--from-literal=username=$(kubectl get secret ... -o jsonpath='{.data.username}'
| base64 -d) and --from-literal=password=$(kubectl get secret ... -o
jsonpath='{.data.password}' | base64 -d) if the source secret is present,
documenting the error message to show when the source secret is missing.

163-163: Consider explaining the target value choice.

The target value of '1' means KEDA will scale to have one running request per replica. While the formula is explained in callout #5, adding context about why '1' is chosen (e.g., one concurrent request per replica, adjust based on GPU capacity) would help users understand if they should modify this value for their workload.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` at
line 163, Update the documentation near the KEDA scaler configuration where
"value: '1'" is specified to explain why the target is set to '1' and when to
change it: state that this target means one running request per replica
(referencing the formula in callout `#5`), recommend increasing the value for GPUs
or high-concurrency-capable replicas and decreasing for CPU-bound or
memory-constrained workloads, and provide a brief guideline (e.g., choose based
on expected concurrent requests per replica and hardware capacity) so readers
can decide whether to adjust "value: '1'".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx`:
- Around line 166-172: Update Callout `#2` to clarify that "prom-basic-auth" in
the example is the TriggerAuthentication resource name and the actual Secret
resource is "prom-basic-auth-secret"; instruct readers to replace the Secret
name (prom-basic-auth-secret) with their secret and, if needed, the
TriggerAuthentication name (prom-basic-auth) with their TriggerAuthentication
resource—refer to the Callout `#2` text and the example Secret identifier
prom-basic-auth-secret shown earlier to make this distinction explicit.
- Around line 158-159: Update the PromQL query placeholder to match the
document's consistent naming: replace isvc_name="<your-model-name>" with
isvc_name="<your-isvc-name>" in the query string so it aligns with other
references; locate the query shown under the vllm:num_requests_running example
and change only the placeholder token to "<your-isvc-name>" to preserve
consistency.

---

Nitpick comments:
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx`:
- Around line 37-62: The ClusterRole kserve-keda-manager-role currently uses
wildcard resources/verbs; replace the "*" resource/verb entries with explicit
KEDA resources and verbs needed by KServe (e.g., resources: scaledobjects,
triggerauthentications and verbs: get, list, watch, create, update, patch,
delete) so the role grants least privilege, and ensure the existing
ClusterRoleBinding referencing the ServiceAccount kserve-controller-manager in
namespace kserve remains intact to bind the narrowed role.
- Around line 99-105: The docs assume the secret
kube-prometheus-alertmanager-basic-auth exists in cpaas-system; add a
verification step before the kubectl create secret command to check for that
source secret (e.g., run kubectl get secret
kube-prometheus-alertmanager-basic-auth -n cpaas-system and handle a non-zero
exit or missing output) and only proceed to the create secret command that uses
--from-literal=username=$(kubectl get secret ... -o jsonpath='{.data.username}'
| base64 -d) and --from-literal=password=$(kubectl get secret ... -o
jsonpath='{.data.password}' | base64 -d) if the source secret is present,
documenting the error message to show when the source secret is missing.
- Line 163: Update the documentation near the KEDA scaler configuration where
"value: '1'" is specified to explain why the target is set to '1' and when to
change it: state that this target means one running request per replica
(referencing the formula in callout `#5`), recommend increasing the value for GPUs
or high-concurrency-capable replicas and decreasing for CPU-bound or
memory-constrained workloads, and provide a brief guideline (e.g., choose based
on expected concurrent requests per replica and hardware capacity) so readers
can decide whether to adjust "value: '1'".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f745cb0-800b-4b16-b4c3-d6cb7fbbc79b

📥 Commits

Reviewing files that changed from the base of the PR and between b2c4ac1 and 0d019e2.

📒 Files selected for processing (1)
  • docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx

Comment on lines +166 to +172
<Callouts>
1. Disables the built-in KServe HPA and delegates scaling to KEDA.
2. References the `Secret` that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual secret.
3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values.
4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`.
5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas.
</Callouts>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the callout about authentication reference.

Callout #2 (line 168) states "Replace prom-basic-auth with the name of your actual secret," but prom-basic-auth is the TriggerAuthentication resource name, not the Secret name. The Secret name is prom-basic-auth-secret (as shown in line 99). This could confuse readers.

📝 Suggested clarification
 <Callouts>
 1. Disables the built-in KServe HPA and delegates scaling to KEDA.
-2. References the `Secret` that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual secret.
+2. References the `TriggerAuthentication` resource that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual TriggerAuthentication.
 3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values.
 4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`.
 5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas.
 </Callouts>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<Callouts>
1. Disables the built-in KServe HPA and delegates scaling to KEDA.
2. References the `Secret` that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual secret.
3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values.
4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`.
5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas.
</Callouts>
<Callouts>
1. Disables the built-in KServe HPA and delegates scaling to KEDA.
2. References the `TriggerAuthentication` resource that holds the credentials for authenticating with Prometheus. Replace `prom-basic-auth` with the name of your actual TriggerAuthentication.
3. A PromQL query that returns the current load as a single numeric value. Replace `<your-model-name>` and `<your-namespace>` with your actual values.
4. The internal address of your Prometheus instance, e.g., `http://prometheus-operated.cpaas-system.svc.cluster.local:9090`.
5. The per-replica target value. KEDA computes `ceil(metricValue / value)` to determine the desired number of replicas.
</Callouts>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 166 - 172, Update Callout `#2` to clarify that "prom-basic-auth" in the
example is the TriggerAuthentication resource name and the actual Secret
resource is "prom-basic-auth-secret"; instruct readers to replace the Secret
name (prom-basic-auth-secret) with their secret and, if needed, the
TriggerAuthentication name (prom-basic-auth) with their TriggerAuthentication
resource—refer to the Callout `#2` text and the example Secret identifier
prom-basic-auth-secret shown earlier to make this distinction explicit.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 9, 2026

Deploying alauda-ai with  Cloudflare Pages  Cloudflare Pages

Latest commit: b2e3323
Status: ✅  Deploy successful!
Preview URL: https://5fbe4941.alauda-ai.pages.dev
Branch Preview URL: https://add-keda.alauda-ai.pages.dev

View logs

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx`:
- Around line 77-84: Add a step to remove the stop annotation so the
InferenceService can resume after KEDA config: after the "Configure the
InferenceService for KEDA" section and before "Verify the Setup", instruct the
user to remove the serving.kserve.io/stop annotation (i.e., kubectl annotate
inferenceservice <your-isvc-name> -n <your-namespace> serving.kserve.io/stop-)
so the InferenceService can start and KEDA autoscaling will take effect;
reference the annotation key serving.kserve.io/stop and the sections "Configure
the InferenceService for KEDA" and "Verify the Setup" to show where to add this
step.
- Around line 19-21: The info box claims KEDA can "scale from zero to N
instances" but the guide configures an InferenceService in RawDeployment mode
which uses Kubernetes HPA and does not support scale-from-zero; update the text
in the info box to either remove the scale-from-zero claim or qualify it by
adding a line such as "KEDA can scale from zero to N when used with Knative
mode; this guide demonstrates metric-based autoscaling with RawDeployment, which
does not support scaling from zero (RawDeployment uses HPA and lacks Knative's
activator)." Ensure you reference RawDeployment and InferenceService so readers
know the limitation applies to this configuration and maintain the existing KEDA
vs Knative distinction.
- Around line 35-36: The RBAC example currently gives kserve-controller-manager
blanket KEDA access via resources: ["*"] and verbs: ["*"]; replace that wildcard
rule in the Role/ClusterRole snippet that references keda.sh with a scoped rule
that lists only the required apiGroup keda.sh and resources scaledobjects,
scaledobjects/finalizers, scaledobjects/status and the verbs create, delete,
get, list, watch, patch, update so the Role assigned to
kserve-controller-manager follows least-privilege; update the YAML example where
kserve-controller-manager and KEDA permissions are declared to use this explicit
list instead of wildcards.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 93f8a5b5-66f6-422a-84c7-07862caca9bc

📥 Commits

Reviewing files that changed from the base of the PR and between 0d019e2 and b2e3323.

📒 Files selected for processing (1)
  • docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx

Comment on lines +19 to +21
:::info
KEDA extends the standard Kubernetes Horizontal Pod Autoscaler (HPA), allowing applications to scale from zero to N instances and back down based on a wide variety of event sources—including Prometheus metrics. It introduces an open and extensible framework so that KServe can scale on virtually any signal relevant to your AI model's performance.
:::
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | head -100

Repository: alauda/aml-docs

Length of output: 4712


🌐 Web query:

KServe scale from zero HTTP requests Standard mode

💡 Result:

In KServe “Standard” deployment mode (a.k.a. RawDeployment in some UIs/docs), scale-from-zero based on incoming HTTP requests is not supported. Standard mode runs model servers as regular Kubernetes Deployments behind Gateway/Ingress, so there’s no request-buffering “activator” component to hold an HTTP request while the Deployment scales up from 0 replicas. KServe’s docs explicitly note that request-based autoscaling with scale-to-zero isn’t supported in Standard mode. [1], [2]

If you need HTTP scale-to-zero / scale-from-zero, use KServe + Knative (Serverless/Knative mode), which provides the request-driven autoscaling and scale-to-zero behavior. [2], [3]

Sources:
[1] KServe Kubernetes Deployment Installation Guide (Standard mode note about scale-from-zero for HTTP)
[2] KServe Control Plane / Standard vs Knative mode (Knative provides scale-to-zero)
[3] KServe GitHub README (standard install does not support request-based autoscaling with scale-to-zero)


🌐 Web query:

KServe InferenceService RawDeployment scale-from-zero

💡 Result:

In KServe RawDeployment mode, an InferenceService cannot scale down to 0 or scale up from 0. RawDeployment uses Kubernetes HPA (rather than Knative’s autoscaling), and “scale down and from zero is not supported” in this mode. [1]

If you need scale-from-zero, use Knative/Serverless (KServe “Standard”) autoscaling (KPA) and set minReplicas: 0. KServe docs note you can enable scale-to-zero by setting minReplicas to 0. [2]

Example (KPA / scale-to-zero):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: mymodel
spec:
  predictor:
    minReplicas: 0
    maxReplicas: 5
    # ... model/runtime config ...

Workarounds in RawDeployment: you can still autoscale with HPA (or external autoscalers), but not request-driven scale-from-zero like Knative provides. [1]

Sources:
[1] KServe docs — RawDeployment autoscaling (HPA) and “scale down and from zero is not supported” https://kserve.github.io/archive/0.15/modelserving/autoscaling/raw_deployment_autoscaling/
[2] KServe docs — KPA autoscaling; minReplicas: 0 enables scale-to-zero https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/kpa-autoscaler


🏁 Script executed:

cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | tail -200

Repository: alauda/aml-docs

Length of output: 9080


Qualify the scale-from-zero claim—it is not supported in RawDeployment mode.

The info box (lines 19–21) states that KEDA allows applications to "scale from zero to N instances," but this guide configures an InferenceService in RawDeployment mode (line 28), which does not support scaling down to zero or up from zero. RawDeployment uses Kubernetes HPA, not Knative's autoscaling (KPA), and lacks the request-buffering "activator" component needed for scale-from-zero. The guide achieves metric-based autoscaling only, not scale-from-zero capability.

Either remove the scale-from-zero promise from the info box, or add a clear qualification: "KEDA can scale from zero to N instances when used with Knative mode; this guide demonstrates metric-based autoscaling with RawDeployment, which does not support scaling from zero."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 19 - 21, The info box claims KEDA can "scale from zero to N instances" but
the guide configures an InferenceService in RawDeployment mode which uses
Kubernetes HPA and does not support scale-from-zero; update the text in the info
box to either remove the scale-from-zero claim or qualify it by adding a line
such as "KEDA can scale from zero to N when used with Knative mode; this guide
demonstrates metric-based autoscaling with RawDeployment, which does not support
scaling from zero (RawDeployment uses HPA and lacks Knative's activator)."
Ensure you reference RawDeployment and InferenceService so readers know the
limitation applies to this configuration and maintain the existing KEDA vs
Knative distinction.

Comment on lines +35 to +36
Before proceeding, apply the following RBAC resources to allow `kserve-controller-manager` to manage KEDA objects (`ScaledObject`, `TriggerAuthentication`, etc.):

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | head -60

Repository: alauda/aml-docs

Length of output: 3115


🏁 Script executed:

# Search for KEDA resource references in the codebase
rg -i "scaledObject|triggerAuthentication|trigger" --type yaml --type yml

Repository: alauda/aml-docs

Length of output: 86


🌐 Web query:

KServe KEDA integration permissions required resources

💡 Result:

KServe ↔ KEDA integration: what permissions/resources are required?

1) KServe controller RBAC (what it must be allowed to manage for KEDA)

In upstream KServe, the controller ClusterRole (kserve-manager-role) includes these KEDA-specific permissions (needed so KServe can create/manage the KEDA objects it generates when you set serving.kserve.io/autoscalerClass: "keda"):

  • API group: keda.sh
  • Resources & verbs:
    • scaledobjects, scaledobjects/finalizers: create, delete, get, list, watch, patch, update
    • scaledobjects/status: get, patch, update [1]

(That same role also grants KServe permissions it needs around the scale target and related plumbing, e.g. deployments.apps and horizontalpodautoscalers.autoscaling, but the items above are the KEDA integration–specific pieces.) [1]

2) KEDA resources that must exist in the cluster

KServe’s KEDA autoscaling requires KEDA installed (i.e., KEDA CRDs/controllers available). KServe then creates a ScaledObject per predictor/deployment when configured. [2]

3) If you scale from Prometheus/custom metrics: extra namespaced resources

When using Prometheus (or other external metric backends), you typically also create in the same namespace as the InferenceService:

  • Secret (credentials/CA/bearer token, if needed)
  • TriggerAuthentication referencing that Secret [2]

Also note: TriggerAuthentication is namespaced and (by default) can only be used by a ScaledObject in the same namespace. [3]


Sources:
[1] KServe controller RBAC (kserve-manager-role) showing KEDA ScaledObject permissions (raw.githubusercontent.com)
[2] KServe “Autoscaling with KEDA” docs (ScaledObject creation + TriggerAuthentication/Secret example) (kserve.github.io)
[3] KEDA Authentication docs (TriggerAuthentication namespace scoping) (keda.sh)

Citations:


Scope KEDA permissions to minimal required resources and verbs.

Lines 45-48 use resources: ["*"] and verbs: ["*"], granting kserve-controller-manager blanket access to all KEDA resources and actions. Per upstream KServe RBAC configuration, the controller only needs the following scoped permissions:

- apiGroups:
  - keda.sh
  resources:
  - scaledobjects
  - scaledobjects/finalizers
  - scaledobjects/status
  verbs:
  - create
  - delete
  - get
  - list
  - watch
  - patch
  - update

This applies to lines 42–48. Update the example to follow least privilege by default.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 35 - 36, The RBAC example currently gives kserve-controller-manager
blanket KEDA access via resources: ["*"] and verbs: ["*"]; replace that wildcard
rule in the Role/ClusterRole snippet that references keda.sh with a scoped rule
that lists only the required apiGroup keda.sh and resources scaledobjects,
scaledobjects/finalizers, scaledobjects/status and the verbs create, delete,
get, list, watch, patch, update so the Role assigned to
kserve-controller-manager follows least-privilege; update the YAML example where
kserve-controller-manager and KEDA permissions are declared to use this explicit
list instead of wildcards.

Comment on lines +77 to +84
### Stop the Running InferenceService

Before making changes, stop the running `InferenceService` to avoid conflicts between the existing HPA and the new KEDA-managed scaler. Add the following annotation to stop it:

```bash
kubectl annotate inferenceservice <your-isvc-name> -n <your-namespace> \
serving.kserve.io/stop='true'
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | head -100

Repository: alauda/aml-docs

Length of output: 4712


🏁 Script executed:

cat -n docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx | sed -n '70,220p'

Repository: alauda/aml-docs

Length of output: 6772


🏁 Script executed:

grep -n "serving.kserve.io/stop" docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx

Repository: alauda/aml-docs

Length of output: 93


Add an explicit step to remove the stop annotation before the service can start.

The guide stops the InferenceService with serving.kserve.io/stop='true' but never removes this annotation. Without clearing it after applying the updated manifest, the service remains stopped and KEDA autoscaling cannot function. Add the following after the "Configure the InferenceService for KEDA" section and before "Verify the Setup":

Suggested addition
The HPA output will show the current metric value, the scaling threshold, and the current/desired replica counts. As inference traffic increases, the `TARGETS` value will rise and replicas will scale up automatically.

+### Remove the Stop Annotation
+
+After applying the updated `InferenceService` manifest with KEDA configuration, remove the temporary stop annotation so the service can start and be managed by KEDA:
+
+```bash
+kubectl annotate inferenceservice <your-isvc-name> -n <your-namespace> serving.kserve.io/stop-
+```
+
 </Steps>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/model_inference/inference_service/how_to/keda_autoscaling.mdx` around
lines 77 - 84, Add a step to remove the stop annotation so the InferenceService
can resume after KEDA config: after the "Configure the InferenceService for
KEDA" section and before "Verify the Setup", instruct the user to remove the
serving.kserve.io/stop annotation (i.e., kubectl annotate inferenceservice
<your-isvc-name> -n <your-namespace> serving.kserve.io/stop-) so the
InferenceService can start and KEDA autoscaling will take effect; reference the
annotation key serving.kserve.io/stop and the sections "Configure the
InferenceService for KEDA" and "Verify the Setup" to show where to add this
step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant