The NIM Account ConfigMap (odh-nim-account-cm) exceeds Kubernetes' 1MB etcd limit when the number of NVIDIA NIM models grew beyond ~180 models. The controller stores full API response data per model (~6KB each), causing the ConfigMap to reach ~1.1MB.
Symptoms:
- NIM enablement via Dashboard succeeds (API key validation passes)
- ConfigMap creation fails with etcd size limit error
- Account status shows condition
ConfigMapUpdatewith statusFalse - NIM models are not available in the Dashboard
Affected versions:
- RHOAI 2.25.x prior to 2.25.3 (scheduled for 2025-03-02)
- RHOAI 3.x prior to 3.4 (scheduled for 2025-05-14)
Use this workaround until upgrading to a fixed version.
Jira:
A permanent fix is available in:
The fix re-marshals only the required model fields, reducing ConfigMap size from ~1.1MB to ~180KB.
For clusters running affected versions before the fix is released, use the nvpe416_workaround.sh script.
- NIM integration must be enabled via the RHOAI Dashboard
- API key validation must have SUCCEEDED (check Account status)
- ConfigMap creation must have FAILED (the oversized error)
- Use the SAME API key that was used for enablement
- Only PERSONAL API keys (
nvapi-*) are supported - NOT legacy keys oc/kubectlmust be logged into the cluster with admin privilegesjqmust be installed on the machine running the script
# Basic usage (uses defaults: redhat-ods-applications/odh-nim-account)
./nvpe416_workaround.sh nvapi-xxxxxxxxxxxxxxxxxxxx
# With custom namespace/account
./nvpe416_workaround.sh --namespace my-namespace --account my-nim-account nvapi-xxxxxxxxxxxxxxxxxxxx
# Dry run (preview changes without applying)
./nvpe416_workaround.sh --dry-run nvapi-xxxxxxxxxxxxxxxxxxxx
# Custom throttle duration (default: 720h = 30 days)
./nvpe416_workaround.sh --throttle 2160h nvapi-xxxxxxxxxxxxxxxxxxxxThe --throttle option sets how long the controller skips re-validation and ConfigMap refresh during reconciliation. The controller is triggered by Kubernetes reconciliation events (at least twice daily). The throttle only applies after a successful operation:
- If the previous validation/ConfigMap refresh succeeded and the throttle duration has not passed, the controller skips the operation during reconciliation.
- If the previous attempt failed, the throttle is ignored and the operation runs on the next reconciliation.
While the throttle is active:
- The API key is not re-validated with NVIDIA's API. If the key expires on NVIDIA's end during this period, model deployments will fail.
- The ConfigMap is not refreshed, so new models added by NVIDIA won't appear.
Default throttle: 720h (30 days). Set it long enough to allow time for upgrading, but not so long that key expiration becomes a concern.
Important: If the cluster is not upgraded to a fixed RHOAI version before the throttle expires, the controller will attempt to refresh the ConfigMap on the next reconciliation, fail due to the size limit, and the issue will return.
-
Phase 1: Validation & Data Gathering (~1-2 minutes, no downtime)
- Validates the API key
- Verifies the Account exists and validation has succeeded
- Fetches NIM model metadata from NVIDIA API
- Extracts only required fields (matches the permanent fix)
-
Phase 2: Scale Down & Apply (brief downtime)
- Sets OLM subscription to Manual (prevents operator interference)
- Scales down
rhods-operatorandodh-model-controller - Deletes the validating webhook temporarily
- Creates the trimmed ConfigMap with proper owner reference
- Patches Account status with merged conditions
-
Phase 3: Scale Up & Restore
- Scales controller back up (recreates webhook automatically)
- Scales operator back up
- Restores OLM subscription to original state
After running the script:
| Check | Expected |
|---|---|
| ConfigMap size | ~150-200 KB (well under 1MB) |
| APIKeyValidation | True |
| ConfigMapUpdate | True |
| TemplateUpdate | True |
| SecretUpdate | True |
| AccountStatus | True |
Verify with:
kubectl get account.nim.opendatahub.io odh-nim-account -n redhat-ods-applications -o json | jq '{
conditions: [.status.conditions[] | {type, status}],
nimConfig: .status.nimConfig.name
}'Ensure NIM is enabled via the Dashboard first. The Account is created when you enable NIM integration.
Wait for the Dashboard enablement to complete validation, or check if the API key is valid.
Check the ConfigMap size. If it's still over 1MB, there may be an issue with the data fetching. Try running with --dry-run to inspect the generated ConfigMap.
The script sets throttles to prevent this. If it happens:
- Check that
nimConfigRefreshRateandvalidationRefreshRateare set in the Account spec - Verify
lastSuccessfulConfigRefreshis afterlastSuccessfulValidationin status
The script temporarily deletes the validating webhook. It's recreated when the controller starts. If you see webhook errors, wait for the controller to fully start.