fix(cloudsupport): make AKS cloud-describe with timeout by yugal07 · Pull Request #149 · kubescape/k8s-interface

yugal07 · 2026-04-28T18:15:57Z

Summary

Fixes the AKS air-gap path where cloudsupport.GetDescriptiveInfoFromCloudProvider
hangs indefinitely on azidentity.NewDefaultAzureCredential and unbounded ARM
calls, starving the kubescape scan loop. Resolves the root cause behind
kubescape/helm-charts#637.

The cloud-describe step is now best-effort:

All Azure credential and ARM calls are bounded by a 5-second context timeout.
A new sentinel ErrCloudDescribeUnavailable is returned when describe fails
for non-fatal reasons (offline mode, missing creds, network unreachable).
A KS_OFFLINE=true short-circuit skips cloud-describe entirely. This env var
is already set by the Helm chart when capabilities.kubescapeOffline=enable.

Callers in kubescape recognize the sentinel via errors.Is and continue the
scan instead of aborting, so host-scanner / node-agent data still gets
collected and persisted.

Why this matters

In an AKS air-gapped cluster:

cloudsupport.GetCloudProvider detects AKS via the node's azure:// providerID.
The describe path calls azidentity.NewDefaultAzureCredential(nil) which
probes IMDS at 169.254.169.254 and the AAD token endpoint.
With egress to those endpoints blocked, every credential source in the
chain hits its own connect timeout sequentially, then the ARM Get call
hangs against a non-routable ARM endpoint.
The total stall is on the order of minutes per scan iteration, which
prevents the host-scanner reporting pipeline from completing in time,
the symptom users see is missing cniinfo-*, kubeletinfo-*,
kubeproxyinfo-*, and controlplaneinfo-* workloadconfigurationscan CRDs.

Changes

`cloudsupport/v1/akssupport.go`

New aksCallTimeout = 5 * time.Second constant.
GetClusterDescribe, ListAllRolesForScope, ListAllRoleDefinitions now
derive a deadline-bound ctx via context.WithTimeout and pass it into
the Azure SDK call. Failures are bounded; no behavioural change when the
control plane is reachable.

`cloudsupport/cloudproviderconfiguration.go`

New ErrCloudDescribeUnavailable sentinel error.
New KS_OFFLINE_ENV_VAR constant + cloudDescribeDisabled() helper.
All four entry points (GetDescriptiveInfoFromCloudProvider,
GetDescribeRepositoriesFromCloudProvider,
GetListEntitiesForPoliciesFromCloudProvider,
GetPolicyVersionFromCloudProvider) short-circuit at the top with the
sentinel when KS_OFFLINE=true.
AKS-branch errors (subscription/resource-group missing, ARM call failed)
are wrapped with the sentinel so callers can errors.Is and treat them
as non-fatal.

`cloudsupport/cloudproviderconfiguration_test.go` (new)

TestKSOfflineShortCircuitsCloudDescribe — asserts every entry point honors
KS_OFFLINE=true and returns ErrCloudDescribeUnavailable.
TestAKSMissingCredsWrapsSentinel — asserts the AKS error path returns
the sentinel even without KS_OFFLINE, which is the realistic state for
air-gapped users with no Azure creds configured.

Test plan

Unit

go test ./cloudsupport/... — all green, incl. existing AKS describe tests.
go vet ./cloudsupport/... — clean.

End-to-end (kind, simulated AKS air-gap)

3-node kind cluster with azure://... providerID baked into kubelet
config per node, so cloudsupport.GetCloudProvider returns AKS.
iptables OUTPUT -d 169.254.169.254 -j DROP on every node to simulate
IMDS unreachability the AKS-style way (drop, not reject).
kubescape installed via Helm with capabilities.kubescapeOffline=enable,
using a custom image built from this PR + a corresponding kubescape
consumer-side patch.
Triggered scan completed in 49s (vs. minutes-long stall on stock build).
Downloading cloud resources... → Downloaded cloud resources finished
within the same second; no DefaultAzureCredential traces in the log.
Scan report carries cloud-describe-unavailable next to AKS-specific
controls (CIS-AKS, "Manage Kubernetes RBAC users with Azure AD", etc.) —
the sentinel propagates into user-facing output.
All 11 frameworks scored normally (security, AllControls, ArmoBest,
cis-aks-t1.2.0, cis-aks-t1.8.0, cis-v1.10.0, cis-v1.12.0, DevOpsBest,
MITRE, NSA, SOC2) — proves AKS detection still happens and the rest of
the scan pipeline is unaffected.

Risk / compatibility

Public API surface unchanged. IAKSSupport interface signatures preserved,
so existing mocks (AKSSupportMock) and consumers compile without changes.
New exported symbols: ErrCloudDescribeUnavailable, KS_OFFLINE_ENV_VAR.
Both additive.
Older consumers that don't recognize ErrCloudDescribeUnavailable see a
wrapped error and fall through to whatever their existing fallback path is —
no regression vs. today.
The 5s timeout is a hard bound on the happy path. Real-AKS users with a
reachable control plane should see no behavioural change; if anyone
regularly relies on cloud-describe taking >5s, they'd notice — but that
would already be a symptom of an unhealthy AKS connection.

Follow-ups (separate PRs)

kubescape: bump k8s-interface, update
core/pkg/resourcehandler/k8sresources.go to recognize the sentinel and
log + continue. (Branch ready.) PR For that - fix(resourcehandler): recognize ErrCloudDescribeUnavailable as non-fatal kubescape#2003
kubescape/operator: bump k8s-interface go.mod (no code change needed).
Optional: apply the same context.WithTimeout pattern to
cloudsupport/v1/ekssupport.go and gkesupport.go for symmetry. AWS/GCP
SDKs fail faster in practice, so this is a robustness nice-to-have rather
than a fix for a reported bug.

Summary by CodeRabbit

New Features
- Offline mode lets cloud provider describe operations be fully skipped to avoid unnecessary calls.
- Added timeouts for Azure interactions to prevent hanging requests.
Bug Fixes
- Improved error signaling so callers can reliably detect when cloud-describe operations are unavailable or fail due to air-gapped/missing-credentials scenarios.

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

coderabbitai · 2026-04-28T18:16:13Z

Warning

Rate limit exceeded

@yugal07 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 46 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aec3f9ef-5707-4f75-a641-7966f73e256e

📥 Commits

Reviewing files that changed from the base of the PR and between ab5c59c and 7b47e45.

📒 Files selected for processing (1)

cloudsupport/cloudproviderconfiguration_test.go

📝 Walkthrough

Walkthrough

Add KS_OFFLINE short-circuiting and exported sentinel ErrCloudDescribeUnavailable to cloud-describe entry points; AKS-specific errors are wrapped with that sentinel. Apply operation timeouts to Azure ARM calls in AKS support functions and add unit tests for offline and air-gapped scenarios.

Changes

Cohort / File(s)	Summary
Cloud describe control & sentinel `cloudsupport/cloudproviderconfiguration.go`	Add exported `ErrCloudDescribeUnavailable` and `cloudDescribeDisabled()` checks; `Get*FromCloudProvider` entry points return wrapped `ErrCloudDescribeUnavailable` when offline or AKS describe failures occur.
Tests for offline / air-gapped behavior `cloudsupport/cloudproviderconfiguration_test.go`	Add unit tests that set/unset `KS_OFFLINE` and Azure env vars to assert callers receive `ErrCloudDescribeUnavailable` via `errors.Is`.
AKS Azure SDK timeouts `cloudsupport/v1/akssupport.go`	Introduce `aksCallTimeout` and `aksRBACEnumerationTimeout` constants and use `context.WithTimeout` for Azure ARM interactions in `GetClusterDescribe`, `ListAllRolesForScope`, and `ListAllRoleDefinitions` to avoid indefinite blocking.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as Caller
  participant CP as CloudProvider (local code)
  participant Azure as Azure ARM SDK

  Caller->>CP: Invoke Get*FromCloudProvider(...)
  alt KS_OFFLINE=true
    CP->>Caller: return ErrCloudDescribeUnavailable
  else KS_OFFLINE=false
    CP->>CP: create ctx with timeout
    CP->>Azure: SDK call with ctx
    alt SDK success
      Azure-->>CP: result
      CP-->>Caller: return result
    else SDK error/unreachable
      Azure-->>CP: error
      CP->>CP: wrap error with ErrCloudDescribeUnavailable
      CP-->>Caller: return wrapped error
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the cloud and found a gate,

KS_OFFLINE said "skip — wait!"
Timeouts set to keep things spry,
Errors wrapped so callers spy,
Tests hop in — the hop's complete.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding timeout handling to AKS cloud-describe operations to prevent indefinite blocking in offline/air-gap scenarios.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 24 minutes and 46 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cloudsupport/v1/akssupport.go (1)
136-160: ⚠️ Potential issue | 🟠 Major

Share one timeout across the full role-definition flow.

ListAllRoleDefinitions creates a fresh 5-second deadline and then calls ListAllRolesForScope at line 144, which creates another independent 5-second context at line 102. The list operation consumes time from its own budget, leaving only the remaining time in the outer deadline for the subsequent GetByID calls in the loop at line 154. If the listing phase consumes most of the outer 5-second window, the definition-fetch phase may timeout prematurely. Pass the context through ListAllRolesForScope instead of recreating it, so the entire operation shares a single deadline.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cloudsupport/v1/akssupport.go` around lines 136 - 160, ListAllRoleDefinitions
currently creates a context (ctx) with aksCallTimeout and then calls
ListAllRolesForScope which itself creates a new independent timeout; change
ListAllRolesForScope to accept a context parameter (e.g., ctx context.Context)
and have ListAllRoleDefinitions pass its ctx into ListAllRolesForScope so the
listing and subsequent GetByID calls share the same deadline; update the
ListAllRolesForScope implementation to stop creating its own context and use the
passed ctx for all Azure calls (and adjust any callers accordingly) so that
GetByID iterations use the same timeout window.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloudsupport/cloudproviderconfiguration_test.go`:
- Around line 48-50: This test mutates process-global env vars
(KS_OFFLINE_ENV_VAR, cloudsupportv1.AZURE_SUBSCRIPTION_ID_ENV_VAR,
cloudsupportv1.AZURE_RESOURCE_GROUP_ENV_VAR) with os.Unsetenv; capture each
original value via os.LookupEnv before unsetting and register a t.Cleanup that
restores the original state (call os.Setenv if the var existed, or os.Unsetenv
if it did not). Replace the raw os.Unsetenv calls with this pattern (or use
t.Setenv to temporarily set values) so the cleanup restores environment for
subsequent tests.

---

Outside diff comments:
In `@cloudsupport/v1/akssupport.go`:
- Around line 136-160: ListAllRoleDefinitions currently creates a context (ctx)
with aksCallTimeout and then calls ListAllRolesForScope which itself creates a
new independent timeout; change ListAllRolesForScope to accept a context
parameter (e.g., ctx context.Context) and have ListAllRoleDefinitions pass its
ctx into ListAllRolesForScope so the listing and subsequent GetByID calls share
the same deadline; update the ListAllRolesForScope implementation to stop
creating its own context and use the passed ctx for all Azure calls (and adjust
any callers accordingly) so that GetByID iterations use the same timeout window.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c724d0a9-3c28-4d00-9569-16812acc6cdc

📥 Commits

Reviewing files that changed from the base of the PR and between 262e4be and dd8caf4.

📒 Files selected for processing (3)

cloudsupport/cloudproviderconfiguration.go
cloudsupport/cloudproviderconfiguration_test.go
cloudsupport/v1/akssupport.go

…ds test - Coderabbit Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cloudsupport/cloudproviderconfiguration_test.go`:
- Around line 51-56: The t.Cleanup closure currently ignores errors from
os.Setenv and os.Unsetenv; update the closure in the test so both calls capture
their returned error and handle it like the earlier env setup code (e.g., if err
!= nil { t.Fatalf("...: %v", err) }). Locate the t.Cleanup anonymous func that
references had, key, and prev and wrap os.Setenv(key, prev) and os.Unsetenv(key)
with error checks that call t.Fatalf (or t.Fatalf-style reporting) including the
err values to satisfy errcheck and match the pattern used elsewhere in the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b662165e-bd7e-41c0-a911-67b6a56b42d5

📥 Commits

Reviewing files that changed from the base of the PR and between dd8caf4 and ab5c59c.

📒 Files selected for processing (2)

cloudsupport/cloudproviderconfiguration_test.go
cloudsupport/v1/akssupport.go

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

matthyx

thanks @yugal07 !

yugal07 · 2026-04-29T07:00:48Z

Thanks @matthyx, This is the kubescape side PR to handle this
kubescape/kubescape#2003
Thanks.

matthyx · 2026-04-29T07:05:21Z

yes, please bump the version there

Adding Cloud unavailable message

dd8caf4

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

yugal07 mentioned this pull request Apr 28, 2026

fix(resourcehandler): recognize ErrCloudDescribeUnavailable as non-fatal kubescape/kubescape#2003

Merged

9 tasks

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread cloudsupport/cloudproviderconfiguration_test.go Outdated

test(cloudsupport): restore env vars via t.Cleanup in AKS missing-cre…

a9295f7

…ds test - Coderabbit Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

matthyx reviewed Apr 29, 2026

View reviewed changes

Comment thread cloudsupport/v1/akssupport.go Outdated

refactor(cloudsupport): split AKS timeout for RBAC enumeration paths

ab5c59c

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread cloudsupport/cloudproviderconfiguration_test.go

test(cloudsupport): check env mutation errors in unsetEnvWithCleanup

7b47e45

Signed-off-by: yugal07 <yashsadhwani544@gmail.com>

matthyx approved these changes Apr 29, 2026

View reviewed changes

matthyx merged commit 35d1027 into kubescape:main Apr 29, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cloudsupport): make AKS cloud-describe with timeout#149

fix(cloudsupport): make AKS cloud-describe with timeout#149
matthyx merged 4 commits intokubescape:mainfrom
yugal07:fix/cloud-aks-descript

yugal07 commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

matthyx left a comment

Uh oh!

Uh oh!

yugal07 commented Apr 29, 2026

Uh oh!

matthyx commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yugal07 commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Changes

cloudsupport/v1/akssupport.go

cloudsupport/cloudproviderconfiguration.go

cloudsupport/cloudproviderconfiguration_test.go (new)

Test plan

Unit

End-to-end (kind, simulated AKS air-gap)

Risk / compatibility

Follow-ups (separate PRs)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matthyx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yugal07 commented Apr 29, 2026

Uh oh!

matthyx commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yugal07 commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

`cloudsupport/v1/akssupport.go`

`cloudsupport/cloudproviderconfiguration.go`

`cloudsupport/cloudproviderconfiguration_test.go` (new)

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading