Skip to content

RHEcosystemAppEng/nvpe-416-workaround

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

NVPE-416 Workaround: NIM ConfigMap Size Limit

Problem

The NIM Account ConfigMap (odh-nim-account-cm) exceeds Kubernetes' 1MB etcd limit when the number of NVIDIA NIM models grew beyond ~180 models. The controller stores full API response data per model (~6KB each), causing the ConfigMap to reach ~1.1MB.

Symptoms:

  • NIM enablement via Dashboard succeeds (API key validation passes)
  • ConfigMap creation fails with etcd size limit error
  • Account status shows condition ConfigMapUpdate with status False
  • NIM models are not available in the Dashboard

Affected versions:

  • RHOAI 2.25.x prior to 2.25.3 (scheduled for 2025-03-02)
  • RHOAI 3.x prior to 3.4 (scheduled for 2025-05-14)

Use this workaround until upgrading to a fixed version.

Jira:

Solution

A permanent fix is available in:

  • Upstream PR: #692 (incubating), #690 (main)
  • Downstream PR: #1414 (rhoai-2.25)

The fix re-marshals only the required model fields, reducing ConfigMap size from ~1.1MB to ~180KB.

Workaround Script

For clusters running affected versions before the fix is released, use the nvpe416_workaround.sh script.

Prerequisites

  1. NIM integration must be enabled via the RHOAI Dashboard
  2. API key validation must have SUCCEEDED (check Account status)
  3. ConfigMap creation must have FAILED (the oversized error)
  4. Use the SAME API key that was used for enablement
  5. Only PERSONAL API keys (nvapi-*) are supported - NOT legacy keys
  6. oc/kubectl must be logged into the cluster with admin privileges
  7. jq must be installed on the machine running the script

Usage

# Basic usage (uses defaults: redhat-ods-applications/odh-nim-account)
./nvpe416_workaround.sh nvapi-xxxxxxxxxxxxxxxxxxxx

# With custom namespace/account
./nvpe416_workaround.sh --namespace my-namespace --account my-nim-account nvapi-xxxxxxxxxxxxxxxxxxxx

# Dry run (preview changes without applying)
./nvpe416_workaround.sh --dry-run nvapi-xxxxxxxxxxxxxxxxxxxx

# Custom throttle duration (default: 720h = 30 days)
./nvpe416_workaround.sh --throttle 2160h nvapi-xxxxxxxxxxxxxxxxxxxx

Understanding the Throttle

The --throttle option sets how long the controller skips re-validation and ConfigMap refresh during reconciliation. The controller is triggered by Kubernetes reconciliation events (at least twice daily). The throttle only applies after a successful operation:

  • If the previous validation/ConfigMap refresh succeeded and the throttle duration has not passed, the controller skips the operation during reconciliation.
  • If the previous attempt failed, the throttle is ignored and the operation runs on the next reconciliation.

While the throttle is active:

  1. The API key is not re-validated with NVIDIA's API. If the key expires on NVIDIA's end during this period, model deployments will fail.
  2. The ConfigMap is not refreshed, so new models added by NVIDIA won't appear.

Default throttle: 720h (30 days). Set it long enough to allow time for upgrading, but not so long that key expiration becomes a concern.

Important: If the cluster is not upgraded to a fixed RHOAI version before the throttle expires, the controller will attempt to refresh the ConfigMap on the next reconciliation, fail due to the size limit, and the issue will return.

What the Script Does

  1. Phase 1: Validation & Data Gathering (~1-2 minutes, no downtime)

    • Validates the API key
    • Verifies the Account exists and validation has succeeded
    • Fetches NIM model metadata from NVIDIA API
    • Extracts only required fields (matches the permanent fix)
  2. Phase 2: Scale Down & Apply (brief downtime)

    • Sets OLM subscription to Manual (prevents operator interference)
    • Scales down rhods-operator and odh-model-controller
    • Deletes the validating webhook temporarily
    • Creates the trimmed ConfigMap with proper owner reference
    • Patches Account status with merged conditions
  3. Phase 3: Scale Up & Restore

    • Scales controller back up (recreates webhook automatically)
    • Scales operator back up
    • Restores OLM subscription to original state

Expected Outcome

After running the script:

Check Expected
ConfigMap size ~150-200 KB (well under 1MB)
APIKeyValidation True
ConfigMapUpdate True
TemplateUpdate True
SecretUpdate True
AccountStatus True

Verify with:

kubectl get account.nim.opendatahub.io odh-nim-account -n redhat-ods-applications -o json | jq '{
  conditions: [.status.conditions[] | {type, status}],
  nimConfig: .status.nimConfig.name
}'

Troubleshooting

Script fails with "Account not found"

Ensure NIM is enabled via the Dashboard first. The Account is created when you enable NIM integration.

Script fails with "API key validation has not succeeded"

Wait for the Dashboard enablement to complete validation, or check if the API key is valid.

ConfigMap still fails after script

Check the ConfigMap size. If it's still over 1MB, there may be an issue with the data fetching. Try running with --dry-run to inspect the generated ConfigMap.

Controller overwrites ConfigMap after script

The script sets throttles to prevent this. If it happens:

  1. Check that nimConfigRefreshRate and validationRefreshRate are set in the Account spec
  2. Verify lastSuccessfulConfigRefresh is after lastSuccessfulValidation in status

Webhook errors

The script temporarily deletes the validating webhook. It's recreated when the controller starts. If you see webhook errors, wait for the controller to fully start.

About

Workaround script for NVPE-416: NIM ConfigMap size limit issue in RHOAI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages