NVPE-416 Workaround: NIM ConfigMap Size Limit

Problem

The NIM Account ConfigMap (odh-nim-account-cm) exceeds Kubernetes' 1MB etcd limit when the number of NVIDIA NIM models grew beyond ~180 models. The controller stores full API response data per model (~6KB each), causing the ConfigMap to reach ~1.1MB.

Symptoms:

NIM enablement via Dashboard succeeds (API key validation passes)
ConfigMap creation fails with etcd size limit error
Account status shows condition ConfigMapUpdate with status False
NIM models are not available in the Dashboard

Affected versions:

RHOAI 2.25.x prior to 2.25.3 (scheduled for 2025-03-02)
RHOAI 3.x prior to 3.4 (scheduled for 2025-05-14)

Use this workaround until upgrading to a fixed version.

Jira:

Bug: NVPE-416
Workaround: NVPE-420

Solution

A permanent fix is available in:

Upstream PR: #692 (incubating), #690 (main)
Downstream PR: #1414 (rhoai-2.25)

The fix re-marshals only the required model fields, reducing ConfigMap size from ~1.1MB to ~180KB.

Workaround Script

For clusters running affected versions before the fix is released, use the nvpe416_workaround.sh script.

Prerequisites

NIM integration must be enabled via the RHOAI Dashboard
API key validation must have SUCCEEDED (check Account status)
ConfigMap creation must have FAILED (the oversized error)
Use the SAME API key that was used for enablement
Only PERSONAL API keys (nvapi-*) are supported - NOT legacy keys
oc/kubectl must be logged into the cluster with admin privileges
jq must be installed on the machine running the script

Usage

# Basic usage (uses defaults: redhat-ods-applications/odh-nim-account)
./nvpe416_workaround.sh nvapi-xxxxxxxxxxxxxxxxxxxx

# With custom namespace/account
./nvpe416_workaround.sh --namespace my-namespace --account my-nim-account nvapi-xxxxxxxxxxxxxxxxxxxx

# Dry run (preview changes without applying)
./nvpe416_workaround.sh --dry-run nvapi-xxxxxxxxxxxxxxxxxxxx

# Custom throttle duration (default: 720h = 30 days)
./nvpe416_workaround.sh --throttle 2160h nvapi-xxxxxxxxxxxxxxxxxxxx

Understanding the Throttle

The --throttle option sets how long the controller skips re-validation and ConfigMap refresh during reconciliation. The controller is triggered by Kubernetes reconciliation events (at least twice daily). The throttle only applies after a successful operation:

If the previous validation/ConfigMap refresh succeeded and the throttle duration has not passed, the controller skips the operation during reconciliation.
If the previous attempt failed, the throttle is ignored and the operation runs on the next reconciliation.

While the throttle is active:

The API key is not re-validated with NVIDIA's API. If the key expires on NVIDIA's end during this period, model deployments will fail.
The ConfigMap is not refreshed, so new models added by NVIDIA won't appear.

Default throttle: 720h (30 days). Set it long enough to allow time for upgrading, but not so long that key expiration becomes a concern.

Important: If the cluster is not upgraded to a fixed RHOAI version before the throttle expires, the controller will attempt to refresh the ConfigMap on the next reconciliation, fail due to the size limit, and the issue will return.

What the Script Does

Phase 1: Validation & Data Gathering (~1-2 minutes, no downtime)
- Validates the API key
- Verifies the Account exists and validation has succeeded
- Fetches NIM model metadata from NVIDIA API
- Extracts only required fields (matches the permanent fix)
Phase 2: Scale Down & Apply (brief downtime)
- Sets OLM subscription to Manual (prevents operator interference)
- Scales down rhods-operator and odh-model-controller
- Deletes the validating webhook temporarily
- Creates the trimmed ConfigMap with proper owner reference
- Patches Account status with merged conditions
Phase 3: Scale Up & Restore
- Scales controller back up (recreates webhook automatically)
- Scales operator back up
- Restores OLM subscription to original state

Expected Outcome

After running the script:

Check	Expected
ConfigMap size	~150-200 KB (well under 1MB)
APIKeyValidation	True
ConfigMapUpdate	True
TemplateUpdate	True
SecretUpdate	True
AccountStatus	True

Verify with:

kubectl get account.nim.opendatahub.io odh-nim-account -n redhat-ods-applications -o json | jq '{
  conditions: [.status.conditions[] | {type, status}],
  nimConfig: .status.nimConfig.name
}'

Troubleshooting

Script fails with "Account not found"

Ensure NIM is enabled via the Dashboard first. The Account is created when you enable NIM integration.

Script fails with "API key validation has not succeeded"

Wait for the Dashboard enablement to complete validation, or check if the API key is valid.

ConfigMap still fails after script

Check the ConfigMap size. If it's still over 1MB, there may be an issue with the data fetching. Try running with --dry-run to inspect the generated ConfigMap.

Controller overwrites ConfigMap after script

The script sets throttles to prevent this. If it happens:

Check that nimConfigRefreshRate and validationRefreshRate are set in the Account spec
Verify lastSuccessfulConfigRefresh is after lastSuccessfulValidation in status

Webhook errors

The script temporarily deletes the validating webhook. It's recreated when the controller starts. If you see webhook errors, wait for the controller to fully start.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
nvpe416_workaround.sh		nvpe416_workaround.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVPE-416 Workaround: NIM ConfigMap Size Limit

Problem

Solution

Workaround Script

Prerequisites

Usage

Understanding the Throttle

What the Script Does

Expected Outcome

Troubleshooting

Script fails with "Account not found"

Script fails with "API key validation has not succeeded"

ConfigMap still fails after script

Controller overwrites ConfigMap after script

Webhook errors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVPE-416 Workaround: NIM ConfigMap Size Limit

Problem

Solution

Workaround Script

Prerequisites

Usage

Understanding the Throttle

What the Script Does

Expected Outcome

Troubleshooting

Script fails with "Account not found"

Script fails with "API key validation has not succeeded"

ConfigMap still fails after script

Controller overwrites ConfigMap after script

Webhook errors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages