Skip to content

fix(cloudsupport): make AKS cloud-describe with timeout#149

Merged
matthyx merged 4 commits intokubescape:mainfrom
yugal07:fix/cloud-aks-descript
Apr 29, 2026
Merged

fix(cloudsupport): make AKS cloud-describe with timeout#149
matthyx merged 4 commits intokubescape:mainfrom
yugal07:fix/cloud-aks-descript

Conversation

@yugal07
Copy link
Copy Markdown
Contributor

@yugal07 yugal07 commented Apr 28, 2026

Summary

Fixes the AKS air-gap path where cloudsupport.GetDescriptiveInfoFromCloudProvider
hangs indefinitely on azidentity.NewDefaultAzureCredential and unbounded ARM
calls, starving the kubescape scan loop. Resolves the root cause behind
kubescape/helm-charts#637.

The cloud-describe step is now best-effort:

  • All Azure credential and ARM calls are bounded by a 5-second context timeout.
  • A new sentinel ErrCloudDescribeUnavailable is returned when describe fails
    for non-fatal reasons (offline mode, missing creds, network unreachable).
  • A KS_OFFLINE=true short-circuit skips cloud-describe entirely. This env var
    is already set by the Helm chart when capabilities.kubescapeOffline=enable.

Callers in kubescape recognize the sentinel via errors.Is and continue the
scan instead of aborting, so host-scanner / node-agent data still gets
collected and persisted.

Why this matters

In an AKS air-gapped cluster:

  1. cloudsupport.GetCloudProvider detects AKS via the node's azure:// providerID.
  2. The describe path calls azidentity.NewDefaultAzureCredential(nil) which
    probes IMDS at 169.254.169.254 and the AAD token endpoint.
  3. With egress to those endpoints blocked, every credential source in the
    chain hits its own connect timeout sequentially, then the ARM Get call
    hangs against a non-routable ARM endpoint.
  4. The total stall is on the order of minutes per scan iteration, which
    prevents the host-scanner reporting pipeline from completing in time,
    the symptom users see is missing cniinfo-*, kubeletinfo-*,
    kubeproxyinfo-*, and controlplaneinfo-* workloadconfigurationscan CRDs.

Changes

cloudsupport/v1/akssupport.go

  • New aksCallTimeout = 5 * time.Second constant.
  • GetClusterDescribe, ListAllRolesForScope, ListAllRoleDefinitions now
    derive a deadline-bound ctx via context.WithTimeout and pass it into
    the Azure SDK call. Failures are bounded; no behavioural change when the
    control plane is reachable.

cloudsupport/cloudproviderconfiguration.go

  • New ErrCloudDescribeUnavailable sentinel error.
  • New KS_OFFLINE_ENV_VAR constant + cloudDescribeDisabled() helper.
  • All four entry points (GetDescriptiveInfoFromCloudProvider,
    GetDescribeRepositoriesFromCloudProvider,
    GetListEntitiesForPoliciesFromCloudProvider,
    GetPolicyVersionFromCloudProvider) short-circuit at the top with the
    sentinel when KS_OFFLINE=true.
  • AKS-branch errors (subscription/resource-group missing, ARM call failed)
    are wrapped with the sentinel so callers can errors.Is and treat them
    as non-fatal.

cloudsupport/cloudproviderconfiguration_test.go (new)

  • TestKSOfflineShortCircuitsCloudDescribe — asserts every entry point honors
    KS_OFFLINE=true and returns ErrCloudDescribeUnavailable.
  • TestAKSMissingCredsWrapsSentinel — asserts the AKS error path returns
    the sentinel even without KS_OFFLINE, which is the realistic state for
    air-gapped users with no Azure creds configured.

Test plan

Unit

  • go test ./cloudsupport/... — all green, incl. existing AKS describe tests.
  • go vet ./cloudsupport/... — clean.

End-to-end (kind, simulated AKS air-gap)

  • 3-node kind cluster with azure://... providerID baked into kubelet
    config per node, so cloudsupport.GetCloudProvider returns AKS.
  • iptables OUTPUT -d 169.254.169.254 -j DROP on every node to simulate
    IMDS unreachability the AKS-style way (drop, not reject).
  • kubescape installed via Helm with capabilities.kubescapeOffline=enable,
    using a custom image built from this PR + a corresponding kubescape
    consumer-side patch.
  • Triggered scan completed in 49s (vs. minutes-long stall on stock build).
  • Downloading cloud resources...Downloaded cloud resources finished
    within the same second; no DefaultAzureCredential traces in the log.
  • Scan report carries cloud-describe-unavailable next to AKS-specific
    controls (CIS-AKS, "Manage Kubernetes RBAC users with Azure AD", etc.) —
    the sentinel propagates into user-facing output.
  • All 11 frameworks scored normally (security, AllControls, ArmoBest,
    cis-aks-t1.2.0, cis-aks-t1.8.0, cis-v1.10.0, cis-v1.12.0, DevOpsBest,
    MITRE, NSA, SOC2) — proves AKS detection still happens and the rest of
    the scan pipeline is unaffected.

Risk / compatibility

  • Public API surface unchanged. IAKSSupport interface signatures preserved,
    so existing mocks (AKSSupportMock) and consumers compile without changes.
  • New exported symbols: ErrCloudDescribeUnavailable, KS_OFFLINE_ENV_VAR.
    Both additive.
  • Older consumers that don't recognize ErrCloudDescribeUnavailable see a
    wrapped error and fall through to whatever their existing fallback path is —
    no regression vs. today.
  • The 5s timeout is a hard bound on the happy path. Real-AKS users with a
    reachable control plane should see no behavioural change; if anyone
    regularly relies on cloud-describe taking >5s, they'd notice — but that
    would already be a symptom of an unhealthy AKS connection.

Follow-ups (separate PRs)

  • kubescape: bump k8s-interface, update
    core/pkg/resourcehandler/k8sresources.go to recognize the sentinel and
    log + continue. (Branch ready.) PR For that - fix(resourcehandler): recognize ErrCloudDescribeUnavailable as non-fatal  kubescape#2003
  • kubescape/operator: bump k8s-interface go.mod (no code change needed).
  • Optional: apply the same context.WithTimeout pattern to
    cloudsupport/v1/ekssupport.go and gkesupport.go for symmetry. AWS/GCP
    SDKs fail faster in practice, so this is a robustness nice-to-have rather
    than a fix for a reported bug.

Summary by CodeRabbit

  • New Features

    • Offline mode lets cloud provider describe operations be fully skipped to avoid unnecessary calls.
    • Added timeouts for Azure interactions to prevent hanging requests.
  • Bug Fixes

    • Improved error signaling so callers can reliably detect when cloud-describe operations are unavailable or fail due to air-gapped/missing-credentials scenarios.

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@yugal07 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 46 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aec3f9ef-5707-4f75-a641-7966f73e256e

📥 Commits

Reviewing files that changed from the base of the PR and between ab5c59c and 7b47e45.

📒 Files selected for processing (1)
  • cloudsupport/cloudproviderconfiguration_test.go
📝 Walkthrough

Walkthrough

Add KS_OFFLINE short-circuiting and exported sentinel ErrCloudDescribeUnavailable to cloud-describe entry points; AKS-specific errors are wrapped with that sentinel. Apply operation timeouts to Azure ARM calls in AKS support functions and add unit tests for offline and air-gapped scenarios.

Changes

Cohort / File(s) Summary
Cloud describe control & sentinel
cloudsupport/cloudproviderconfiguration.go
Add exported ErrCloudDescribeUnavailable and cloudDescribeDisabled() checks; Get*FromCloudProvider entry points return wrapped ErrCloudDescribeUnavailable when offline or AKS describe failures occur.
Tests for offline / air-gapped behavior
cloudsupport/cloudproviderconfiguration_test.go
Add unit tests that set/unset KS_OFFLINE and Azure env vars to assert callers receive ErrCloudDescribeUnavailable via errors.Is.
AKS Azure SDK timeouts
cloudsupport/v1/akssupport.go
Introduce aksCallTimeout and aksRBACEnumerationTimeout constants and use context.WithTimeout for Azure ARM interactions in GetClusterDescribe, ListAllRolesForScope, and ListAllRoleDefinitions to avoid indefinite blocking.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as Caller
  participant CP as CloudProvider (local code)
  participant Azure as Azure ARM SDK

  Caller->>CP: Invoke Get*FromCloudProvider(...)
  alt KS_OFFLINE=true
    CP->>Caller: return ErrCloudDescribeUnavailable
  else KS_OFFLINE=false
    CP->>CP: create ctx with timeout
    CP->>Azure: SDK call with ctx
    alt SDK success
      Azure-->>CP: result
      CP-->>Caller: return result
    else SDK error/unreachable
      Azure-->>CP: error
      CP->>CP: wrap error with ErrCloudDescribeUnavailable
      CP-->>Caller: return wrapped error
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the cloud and found a gate,

KS_OFFLINE said "skip — wait!"
Timeouts set to keep things spry,
Errors wrapped so callers spy,
Tests hop in — the hop's complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding timeout handling to AKS cloud-describe operations to prevent indefinite blocking in offline/air-gap scenarios.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 24 minutes and 46 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cloudsupport/v1/akssupport.go (1)

136-160: ⚠️ Potential issue | 🟠 Major

Share one timeout across the full role-definition flow.

ListAllRoleDefinitions creates a fresh 5-second deadline and then calls ListAllRolesForScope at line 144, which creates another independent 5-second context at line 102. The list operation consumes time from its own budget, leaving only the remaining time in the outer deadline for the subsequent GetByID calls in the loop at line 154. If the listing phase consumes most of the outer 5-second window, the definition-fetch phase may timeout prematurely. Pass the context through ListAllRolesForScope instead of recreating it, so the entire operation shares a single deadline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cloudsupport/v1/akssupport.go` around lines 136 - 160, ListAllRoleDefinitions
currently creates a context (ctx) with aksCallTimeout and then calls
ListAllRolesForScope which itself creates a new independent timeout; change
ListAllRolesForScope to accept a context parameter (e.g., ctx context.Context)
and have ListAllRoleDefinitions pass its ctx into ListAllRolesForScope so the
listing and subsequent GetByID calls share the same deadline; update the
ListAllRolesForScope implementation to stop creating its own context and use the
passed ctx for all Azure calls (and adjust any callers accordingly) so that
GetByID iterations use the same timeout window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloudsupport/cloudproviderconfiguration_test.go`:
- Around line 48-50: This test mutates process-global env vars
(KS_OFFLINE_ENV_VAR, cloudsupportv1.AZURE_SUBSCRIPTION_ID_ENV_VAR,
cloudsupportv1.AZURE_RESOURCE_GROUP_ENV_VAR) with os.Unsetenv; capture each
original value via os.LookupEnv before unsetting and register a t.Cleanup that
restores the original state (call os.Setenv if the var existed, or os.Unsetenv
if it did not). Replace the raw os.Unsetenv calls with this pattern (or use
t.Setenv to temporarily set values) so the cleanup restores environment for
subsequent tests.

---

Outside diff comments:
In `@cloudsupport/v1/akssupport.go`:
- Around line 136-160: ListAllRoleDefinitions currently creates a context (ctx)
with aksCallTimeout and then calls ListAllRolesForScope which itself creates a
new independent timeout; change ListAllRolesForScope to accept a context
parameter (e.g., ctx context.Context) and have ListAllRoleDefinitions pass its
ctx into ListAllRolesForScope so the listing and subsequent GetByID calls share
the same deadline; update the ListAllRolesForScope implementation to stop
creating its own context and use the passed ctx for all Azure calls (and adjust
any callers accordingly) so that GetByID iterations use the same timeout window.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c724d0a9-3c28-4d00-9569-16812acc6cdc

📥 Commits

Reviewing files that changed from the base of the PR and between 262e4be and dd8caf4.

📒 Files selected for processing (3)
  • cloudsupport/cloudproviderconfiguration.go
  • cloudsupport/cloudproviderconfiguration_test.go
  • cloudsupport/v1/akssupport.go

Comment thread cloudsupport/cloudproviderconfiguration_test.go Outdated
…ds test - Coderabbit

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
Comment thread cloudsupport/v1/akssupport.go Outdated
Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloudsupport/cloudproviderconfiguration_test.go`:
- Around line 51-56: The t.Cleanup closure currently ignores errors from
os.Setenv and os.Unsetenv; update the closure in the test so both calls capture
their returned error and handle it like the earlier env setup code (e.g., if err
!= nil { t.Fatalf("...: %v", err) }). Locate the t.Cleanup anonymous func that
references had, key, and prev and wrap os.Setenv(key, prev) and os.Unsetenv(key)
with error checks that call t.Fatalf (or t.Fatalf-style reporting) including the
err values to satisfy errcheck and match the pattern used elsewhere in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b662165e-bd7e-41c0-a911-67b6a56b42d5

📥 Commits

Reviewing files that changed from the base of the PR and between dd8caf4 and ab5c59c.

📒 Files selected for processing (2)
  • cloudsupport/cloudproviderconfiguration_test.go
  • cloudsupport/v1/akssupport.go

Comment thread cloudsupport/cloudproviderconfiguration_test.go
Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
Copy link
Copy Markdown
Contributor

@matthyx matthyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @yugal07 !

@matthyx matthyx merged commit 35d1027 into kubescape:main Apr 29, 2026
9 of 10 checks passed
@yugal07
Copy link
Copy Markdown
Contributor Author

yugal07 commented Apr 29, 2026

Thanks @matthyx, This is the kubescape side PR to handle this
kubescape/kubescape#2003
Thanks.

@matthyx
Copy link
Copy Markdown
Contributor

matthyx commented Apr 29, 2026

yes, please bump the version there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants