fix(cloudsupport): make AKS cloud-describe with timeout#149
fix(cloudsupport): make AKS cloud-describe with timeout#149matthyx merged 4 commits intokubescape:mainfrom
Conversation
Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdd KS_OFFLINE short-circuiting and exported sentinel Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant CP as CloudProvider (local code)
participant Azure as Azure ARM SDK
Caller->>CP: Invoke Get*FromCloudProvider(...)
alt KS_OFFLINE=true
CP->>Caller: return ErrCloudDescribeUnavailable
else KS_OFFLINE=false
CP->>CP: create ctx with timeout
CP->>Azure: SDK call with ctx
alt SDK success
Azure-->>CP: result
CP-->>Caller: return result
else SDK error/unreachable
Azure-->>CP: error
CP->>CP: wrap error with ErrCloudDescribeUnavailable
CP-->>Caller: return wrapped error
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 24 minutes and 46 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cloudsupport/v1/akssupport.go (1)
136-160:⚠️ Potential issue | 🟠 MajorShare one timeout across the full role-definition flow.
ListAllRoleDefinitionscreates a fresh 5-second deadline and then callsListAllRolesForScopeat line 144, which creates another independent 5-second context at line 102. The list operation consumes time from its own budget, leaving only the remaining time in the outer deadline for the subsequentGetByIDcalls in the loop at line 154. If the listing phase consumes most of the outer 5-second window, the definition-fetch phase may timeout prematurely. Pass the context throughListAllRolesForScopeinstead of recreating it, so the entire operation shares a single deadline.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cloudsupport/v1/akssupport.go` around lines 136 - 160, ListAllRoleDefinitions currently creates a context (ctx) with aksCallTimeout and then calls ListAllRolesForScope which itself creates a new independent timeout; change ListAllRolesForScope to accept a context parameter (e.g., ctx context.Context) and have ListAllRoleDefinitions pass its ctx into ListAllRolesForScope so the listing and subsequent GetByID calls share the same deadline; update the ListAllRolesForScope implementation to stop creating its own context and use the passed ctx for all Azure calls (and adjust any callers accordingly) so that GetByID iterations use the same timeout window.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloudsupport/cloudproviderconfiguration_test.go`:
- Around line 48-50: This test mutates process-global env vars
(KS_OFFLINE_ENV_VAR, cloudsupportv1.AZURE_SUBSCRIPTION_ID_ENV_VAR,
cloudsupportv1.AZURE_RESOURCE_GROUP_ENV_VAR) with os.Unsetenv; capture each
original value via os.LookupEnv before unsetting and register a t.Cleanup that
restores the original state (call os.Setenv if the var existed, or os.Unsetenv
if it did not). Replace the raw os.Unsetenv calls with this pattern (or use
t.Setenv to temporarily set values) so the cleanup restores environment for
subsequent tests.
---
Outside diff comments:
In `@cloudsupport/v1/akssupport.go`:
- Around line 136-160: ListAllRoleDefinitions currently creates a context (ctx)
with aksCallTimeout and then calls ListAllRolesForScope which itself creates a
new independent timeout; change ListAllRolesForScope to accept a context
parameter (e.g., ctx context.Context) and have ListAllRoleDefinitions pass its
ctx into ListAllRolesForScope so the listing and subsequent GetByID calls share
the same deadline; update the ListAllRolesForScope implementation to stop
creating its own context and use the passed ctx for all Azure calls (and adjust
any callers accordingly) so that GetByID iterations use the same timeout window.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c724d0a9-3c28-4d00-9569-16812acc6cdc
📒 Files selected for processing (3)
cloudsupport/cloudproviderconfiguration.gocloudsupport/cloudproviderconfiguration_test.gocloudsupport/v1/akssupport.go
…ds test - Coderabbit Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cloudsupport/cloudproviderconfiguration_test.go`:
- Around line 51-56: The t.Cleanup closure currently ignores errors from
os.Setenv and os.Unsetenv; update the closure in the test so both calls capture
their returned error and handle it like the earlier env setup code (e.g., if err
!= nil { t.Fatalf("...: %v", err) }). Locate the t.Cleanup anonymous func that
references had, key, and prev and wrap os.Setenv(key, prev) and os.Unsetenv(key)
with error checks that call t.Fatalf (or t.Fatalf-style reporting) including the
err values to satisfy errcheck and match the pattern used elsewhere in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b662165e-bd7e-41c0-a911-67b6a56b42d5
📒 Files selected for processing (2)
cloudsupport/cloudproviderconfiguration_test.gocloudsupport/v1/akssupport.go
Signed-off-by: yugal07 <yashsadhwani544@gmail.com>
|
Thanks @matthyx, This is the kubescape side PR to handle this |
|
yes, please bump the version there |
Summary
Fixes the AKS air-gap path where
cloudsupport.GetDescriptiveInfoFromCloudProviderhangs indefinitely on
azidentity.NewDefaultAzureCredentialand unbounded ARMcalls, starving the kubescape scan loop. Resolves the root cause behind
kubescape/helm-charts#637.
The cloud-describe step is now best-effort:
ErrCloudDescribeUnavailableis returned when describe failsfor non-fatal reasons (offline mode, missing creds, network unreachable).
KS_OFFLINE=trueshort-circuit skips cloud-describe entirely. This env varis already set by the Helm chart when
capabilities.kubescapeOffline=enable.Callers in
kubescaperecognize the sentinel viaerrors.Isand continue thescan instead of aborting, so host-scanner / node-agent data still gets
collected and persisted.
Why this matters
In an AKS air-gapped cluster:
cloudsupport.GetCloudProviderdetects AKS via the node'sazure://providerID.azidentity.NewDefaultAzureCredential(nil)whichprobes IMDS at
169.254.169.254and the AAD token endpoint.chain hits its own connect timeout sequentially, then the ARM
Getcallhangs against a non-routable ARM endpoint.
prevents the host-scanner reporting pipeline from completing in time,
the symptom users see is missing
cniinfo-*,kubeletinfo-*,kubeproxyinfo-*, andcontrolplaneinfo-*workloadconfigurationscan CRDs.Changes
cloudsupport/v1/akssupport.goaksCallTimeout = 5 * time.Secondconstant.GetClusterDescribe,ListAllRolesForScope,ListAllRoleDefinitionsnowderive a deadline-bound
ctxviacontext.WithTimeoutand pass it intothe Azure SDK call. Failures are bounded; no behavioural change when the
control plane is reachable.
cloudsupport/cloudproviderconfiguration.goErrCloudDescribeUnavailablesentinel error.KS_OFFLINE_ENV_VARconstant +cloudDescribeDisabled()helper.GetDescriptiveInfoFromCloudProvider,GetDescribeRepositoriesFromCloudProvider,GetListEntitiesForPoliciesFromCloudProvider,GetPolicyVersionFromCloudProvider) short-circuit at the top with thesentinel when
KS_OFFLINE=true.are wrapped with the sentinel so callers can
errors.Isand treat themas non-fatal.
cloudsupport/cloudproviderconfiguration_test.go(new)TestKSOfflineShortCircuitsCloudDescribe— asserts every entry point honorsKS_OFFLINE=trueand returnsErrCloudDescribeUnavailable.TestAKSMissingCredsWrapsSentinel— asserts the AKS error path returnsthe sentinel even without
KS_OFFLINE, which is the realistic state forair-gapped users with no Azure creds configured.
Test plan
Unit
go test ./cloudsupport/...— all green, incl. existing AKS describe tests.go vet ./cloudsupport/...— clean.End-to-end (kind, simulated AKS air-gap)
azure://...providerID baked into kubeletconfig per node, so
cloudsupport.GetCloudProviderreturns AKS.OUTPUT -d 169.254.169.254 -j DROPon every node to simulateIMDS unreachability the AKS-style way (drop, not reject).
capabilities.kubescapeOffline=enable,using a custom image built from this PR + a corresponding kubescape
consumer-side patch.
Downloading cloud resources...→Downloaded cloud resourcesfinishedwithin the same second; no
DefaultAzureCredentialtraces in the log.cloud-describe-unavailablenext to AKS-specificcontrols (CIS-AKS, "Manage Kubernetes RBAC users with Azure AD", etc.) —
the sentinel propagates into user-facing output.
cis-aks-t1.2.0, cis-aks-t1.8.0, cis-v1.10.0, cis-v1.12.0, DevOpsBest,
MITRE, NSA, SOC2) — proves AKS detection still happens and the rest of
the scan pipeline is unaffected.
Risk / compatibility
IAKSSupportinterface signatures preserved,so existing mocks (
AKSSupportMock) and consumers compile without changes.ErrCloudDescribeUnavailable,KS_OFFLINE_ENV_VAR.Both additive.
ErrCloudDescribeUnavailablesee awrapped error and fall through to whatever their existing fallback path is —
no regression vs. today.
reachable control plane should see no behavioural change; if anyone
regularly relies on cloud-describe taking >5s, they'd notice — but that
would already be a symptom of an unhealthy AKS connection.
Follow-ups (separate PRs)
kubescape: bumpk8s-interface, updatecore/pkg/resourcehandler/k8sresources.goto recognize the sentinel andlog + continue. (Branch ready.) PR For that - fix(resourcehandler): recognize ErrCloudDescribeUnavailable as non-fatal kubescape#2003
kubescape/operator: bumpk8s-interfacego.mod (no code change needed).context.WithTimeoutpattern tocloudsupport/v1/ekssupport.goandgkesupport.gofor symmetry. AWS/GCPSDKs fail faster in practice, so this is a robustness nice-to-have rather
than a fix for a reported bug.
Summary by CodeRabbit
New Features
Bug Fixes