-
Notifications
You must be signed in to change notification settings - Fork 420
CNTRLPLANE-1772: feat(audit): add persistent audit log feature with automatic snapshots #7241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces a new audit log persistence feature that enables hosted clusters to store kube-apiserver audit logs on persistent volumes with automatic snapshot capabilities for crash recovery and forensic analysis. Key components: - New AuditLogPersistenceConfig CRD for management cluster administrators - Mutating webhooks to inject PVCs and volume mounts into kube-apiserver pods - Snapshot controller that automatically creates VolumeSnapshots on pod crashes - ConfigMap webhook to configure audit log rotation settings - Configurable retention policies for snapshot management The feature is opt-in via the --enable-audit-log-persistence flag during HyperShift operator installation, ensuring backward compatibility and allowing gradual rollout across management clusters. Architecture: - API definitions in api/auditlogpersistence/v1alpha1/ - Controller and webhook logic in hypershift-operator/controllers/auditlogpersistence/ - Installation and RBAC changes in cmd/install/ - Integration with hypershift-operator main.go for webhook registration Signed-off-by: Cesar Wong <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
|
@csrwng: This pull request references CNTRLPLANE-1772 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
WalkthroughThe PR introduces a new audit log persistence feature for HyperShift, adding a custom resource definition (CRD) to configure persistent audit logging, mutating webhooks to inject storage and log configuration into kube-apiserver pods and ConfigMaps, and a snapshot controller to manage automatic volume snapshots with retention policies. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/install/install.go (1)
645-683: Audit log persistence CRD is not actually gated by the feature flagIn
setupCRDs, the intention is to include audit log persistence CRDs only whenopts.EnableAuditLogPersistenceis true:// Conditionally include auditlogpersistence CRD only if feature is enabled if strings.Contains(path, "auditlogpersistence") { return opts.EnableAuditLogPersistence }However, this check currently appears after:
if strings.Contains(path, "hypershift-operator/") { return true }For the audit log persistence CRD, the path will contain
"hypershift-operator/", so the function returnstruebefore it ever reaches the auditlogpersistence condition. As a result, the CRD is always installed regardless of the flag, which contradicts the opt‑in design.Reordering the checks fixes this:
- if strings.Contains(path, "hypershift-operator/") { - return true - } - if strings.Contains(path, "cluster-api/") { - return true - } - // Conditionally include auditlogpersistence CRD only if feature is enabled - if strings.Contains(path, "auditlogpersistence") { - return opts.EnableAuditLogPersistence - } + // Conditionally include auditlogpersistence CRDs only if the feature is enabled + if strings.Contains(path, "auditlogpersistence") { + return opts.EnableAuditLogPersistence + } + if strings.Contains(path, "hypershift-operator/") { + return true + } + if strings.Contains(path, "cluster-api/") { + return true + }This ensures that the new CRD truly respects the feature flag.
🧹 Nitpick comments (4)
cmd/install/install.go (1)
558-563: Webhook, deployment, and RBAC plumbing for audit log persistence is coherent
- The mutating webhook configuration is now created when either defaulting webhooks or audit log persistence are enabled, and the template receives
EnableAuditLogPersistenceto control which webhooks are actually configured.HyperShiftOperatorDeployment.EnableWebhooknow accounts for the new feature, ensuring the operator exposes a webhook server when only audit log persistence is enabled.HyperShiftOperatorDeployment.EnableAuditLogPersistenceandHyperShiftOperatorClusterRole.EnableAuditLogPersistencegive the assets layer enough context to inject the correct env vars and RBAC for the snapshot controller and webhooks.One follow‑up you may want to consider: updating
ApplyDefaultsto setHyperShiftOperatorReplicas = 2when onlyEnableAuditLogPersistenceis true, to keep webhook availability semantics consistent with other webhook‑driven features.Example refinement:
func (o *Options) ApplyDefaults() { o.RenderNamespace = true switch { case o.Development: o.HyperShiftOperatorReplicas = 0 - case o.EnableDefaultingWebhook || o.EnableConversionWebhook || o.EnableValidatingWebhook: + case o.EnableDefaultingWebhook || o.EnableConversionWebhook || o.EnableValidatingWebhook || o.EnableAuditLogPersistence: o.HyperShiftOperatorReplicas = 2 default: o.HyperShiftOperatorReplicas = 1 } }Also applies to: 789-799, 826-827, 967-977
hypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.go (1)
80-104: Tighten assertion for early-return path (noconfig.jsonkey)In the “ConfigMap without config.json key returns early” case you only assert
cm.Data[kubeAPIServerConfigKey]is empty. Because a missing key also reads as"", this would still pass ifmutateConfigMapstarted writing an emptyconfig.jsonentry.To make the test guard the early-return behavior more strictly, consider also (or instead) asserting that the key does not exist:
configJSON, exists := cm.Data[kubeAPIServerConfigKey] g.Expect(exists).To(BeFalse()) g.Expect(configJSON).To(Equal(""))This keeps the intent clear: the webhook should not touch the key at all when it’s absent.
cmd/install/assets/hypershift_operator.go (1)
1007-1042: Audit-log RBAC block partially redundant with existing ClusterRole rulesWithin
HyperShiftOperatorClusterRole.Build, the conditional RBAC block for audit log persistence adds rules for PVCs, pods, and configmaps, but the earlier core-group rule already grantsrbacv1.VerbAllon those resources:{ APIGroups: []string{""}, Resources: []string{ "events", "configmaps", "configmaps/finalizers", "persistentvolumeclaims", "pods", "pods/log", ... }, Verbs: []string{rbacv1.VerbAll}, }So the new core rules under
if o.EnableAuditLogPersistence { ... }don’t change effective privileges and mainly add noise. The only materially new capability islist/watchforsnapshot.storage.k8s.io/volumesnapshots, which is meaningful on top of the existingget/create/deleterule.Consider trimming the conditional block to just the volumesnapshot rule (or, if you intend to truly gate core permissions by this feature, moving the PVC/pod/configmap access out of the broad VerbAll rule into feature-specific blocks—but that’s a larger behavior change).
Also applies to: 1310-1338
hypershift-operator/controllers/auditlogpersistence/configmap_webhook.go (1)
43-110: Refactor is optional but recommended for robustness; current code is functionally correctThe web search confirms that AdmissionRequest.name is populated when the client provides explicit metadata.name, but will be empty if the client omits metadata.name (e.g., uses generateName). However, analysis of the codebase shows that
kas-configis a system-managed ConfigMap always created with explicitmetadata.namein all fixtures and tests—never viagenerateName. This means the current check onreq.Nameis functionally correct for this use case.That said, the suggested refactor to check
configMap.Nameafter decoding remains a worthwhile defensive improvement. It makes the handler agnostic to how AdmissionRequest populates the name field and is more maintainable should the creation pattern change.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (6)
vendor/github.com/openshift/hypershift/api/auditlogpersistence/register.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/doc.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/register.gois excluded by!vendor/**,!**/vendor/**vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**,!**/vendor/**vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (17)
Makefile(1 hunks)api/auditlogpersistence/register.go(1 hunks)api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go(1 hunks)api/auditlogpersistence/v1alpha1/doc.go(1 hunks)api/auditlogpersistence/v1alpha1/register.go(1 hunks)api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go(1 hunks)cmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yaml(1 hunks)cmd/install/assets/hypershift_operator.go(7 hunks)cmd/install/install.go(7 hunks)hypershift-operator/controllers/auditlogpersistence/configmap_webhook.go(1 hunks)hypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.go(1 hunks)hypershift-operator/controllers/auditlogpersistence/pod_webhook.go(1 hunks)hypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go(1 hunks)hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go(1 hunks)hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go(1 hunks)hypershift-operator/main.go(3 hunks)support/api/scheme.go(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
Makefileapi/auditlogpersistence/register.gocmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yamlapi/auditlogpersistence/v1alpha1/register.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.gosupport/api/scheme.gohypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.gohypershift-operator/controllers/auditlogpersistence/pod_webhook.gohypershift-operator/controllers/auditlogpersistence/configmap_webhook.gocmd/install/install.gohypershift-operator/main.goapi/auditlogpersistence/v1alpha1/zz_generated.deepcopy.goapi/auditlogpersistence/v1alpha1/doc.goapi/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.gocmd/install/assets/hypershift_operator.gohypershift-operator/controllers/auditlogpersistence/snapshot_controller.gohypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Red Hat Konflux / control-plane-operator-main-on-pull-request
- GitHub Check: Red Hat Konflux / hypershift-operator-main-on-pull-request
- GitHub Check: Red Hat Konflux / hypershift-release-mce-211-on-pull-request
- GitHub Check: Red Hat Konflux / hypershift-cli-mce-211-on-pull-request
🔇 Additional comments (12)
api/auditlogpersistence/register.go (1)
1-3: Group name definition looks correct and consistent
GroupNamematches the v1alpha1 package annotations and registration usage; this provides a clean anchor for the new API group.Makefile (1)
191-195: CRD generation wiring for auditlogpersistence is consistentAdding
./api/auditlogpersistence/...to the CRD generation step mirrors existing patterns (scheduling, certificates) and ensures assets stay in sync with the API types.support/api/scheme.go (1)
6-7: Scheme registration for audit log persistence API is correctly wiredImporting
auditlogpersistencev1alpha1and callingAddToSchemealongside other HyperShift APIs is consistent with the existing scheme setup and enables the new CRD types to be used by clients and controllers.Also applies to: 88-107
hypershift-operator/main.go (1)
28-29: Audit log persistence webhooks and controller are wired in coherentlyThe new imports and the
ENABLE_AUDIT_LOG_PERSISTENCE‑gated block correctly:
- Register pod and ConfigMap mutating webhooks only when TLS certs are configured (non‑empty
CertDir), and- Always wire the snapshot controller when the feature is enabled.
This keeps the feature optional and avoids impacting clusters where the env var is unset, while reusing the manager’s scheme, logger, and client in a standard way.
Also applies to: 75-76, 547-580
api/auditlogpersistence/v1alpha1/doc.go (1)
1-4: Codegen annotations for v1alpha1 are appropriateThe deepcopy, groupName, and openapi annotations are standard for CRD API packages and align with the defined group.
api/auditlogpersistence/v1alpha1/register.go (1)
1-37: v1alpha1 scheme registration follows Kubernetes conventions
SchemeGroupVersion, Kind/Resource helpers, andaddKnownTypescorrectly registerAuditLogPersistenceConfigand its list type into the scheme with the expected group/version.cmd/install/install.go (1)
77-83: New install option and flag are threaded cleanlyIntroducing
EnableAuditLogPersistenceonOptionsand exposing it via--enable-audit-log-persistenceintegrates well with the existing install flow, keeping the feature opt‑in from the CLI down into manifest generation.Also applies to: 132-133, 281-282
cmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yaml (1)
1-203: CRD schema, defaults, and validation look consistentThe CRD shape (cluster-scoped, single
clusterinstance, sane defaults for auditLog/snapshots/storage, and regexes for duration/quantity) aligns with the described feature and shouldn’t pose compatibility issues.hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go (1)
54-1276: Comprehensive snapshot controller test coverage looks solidThe table-driven tests around
SnapshotReconciler(reconcile paths, retention management, timestamp sorting, integer parsing, pod classification, and snapshot creation) give broad coverage of success and edge conditions, and the helper functions make side-effect checks clear and reusable. No major issues stand out here.hypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go (1)
21-561: Comprehensive mutatePod coverage looks solidThe table-driven tests exercise all key mutatePod branches (name vs generateName, missing names, ReplicaSet ownership, existing PVC, storage class handling, and volume wiring), using a fake client + scheme correctly. This gives good confidence in the PVC and volume mutation behavior.
hypershift-operator/controllers/auditlogpersistence/pod_webhook.go (1)
49-127: Admission handler + mutation flow look correct and well-scopedThe webhook correctly narrows scope to kube-apiserver Pods in hosted control-plane namespaces, checks the cluster-scoped
AuditLogPersistenceConfigandSpec.Enabled, deep-copies the Pod before mutation, and returns a patch response. Combined withmutatePod, the PVC creation andlogsvolume rewiring for kube-apiserver Pods are implemented cleanly without obvious correctness or security pitfalls.api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go (1)
1-171: Generated deepcopy code is consistent with the v1alpha1 typesThis file appears to be standard controller-gen output for the new audit log persistence types, with correct handling of nested structs and slices. No changes needed here.
| // PerPodRetentionCount is the maximum number of snapshots to retain per PVC. | ||
| // When this limit is reached, the oldest snapshot for that PVC will be deleted. | ||
| // +kubebuilder:default=10 | ||
| PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"` | ||
|
|
||
| // NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace. | ||
| // When this limit is reached, the oldest snapshot in the namespace will be deleted. | ||
| // +kubebuilder:default=50 | ||
| NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Enforce non-negative snapshot retention counts at the API level
PerPodRetentionCount and NamespaceRetentionCount are currently unbounded int32 values with defaults but no minimum validation. In manageRetention (Line 215 onwards in snapshot_controller.go), negative values will lead to toDelete exceeding the slice length and cause a panic when indexing into the snapshots slice.
To prevent invalid configs from ever reaching the controller, add Minimum=0 validation on both fields:
// PerPodRetentionCount is the maximum number of snapshots to retain per PVC.
// When this limit is reached, the oldest snapshot for that PVC will be deleted.
// +kubebuilder:default=10
+ // +kubebuilder:validation:Minimum=0
PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"`
// NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace.
// When this limit is reached, the oldest snapshot in the namespace will be deleted.
// +kubebuilder:default=50
+ // +kubebuilder:validation:Minimum=0
NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"`This keeps the CRD self-defensive and aligns with the expectations in the controller logic.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // PerPodRetentionCount is the maximum number of snapshots to retain per PVC. | |
| // When this limit is reached, the oldest snapshot for that PVC will be deleted. | |
| // +kubebuilder:default=10 | |
| PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"` | |
| // NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace. | |
| // When this limit is reached, the oldest snapshot in the namespace will be deleted. | |
| // +kubebuilder:default=50 | |
| NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"` | |
| // PerPodRetentionCount is the maximum number of snapshots to retain per PVC. | |
| // When this limit is reached, the oldest snapshot for that PVC will be deleted. | |
| // +kubebuilder:default=10 | |
| // +kubebuilder:validation:Minimum=0 | |
| PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"` | |
| // NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace. | |
| // When this limit is reached, the oldest snapshot in the namespace will be deleted. | |
| // +kubebuilder:default=50 | |
| // +kubebuilder:validation:Minimum=0 | |
| NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"` |
🤖 Prompt for AI Agents
In api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go around
lines 83 to 91, the PerPodRetentionCount and NamespaceRetentionCount fields lack
minimum validation which allows negative values and can cause panics in the
controller; add kubebuilder validation tags to enforce a non-negative minimum
(Minimum=0) on both fields so the generated CRD rejects negative values and
aligns with controller expectations.
| func (r *SnapshotReconciler) manageRetention(ctx context.Context, pod *corev1.Pod, pvc *corev1.PersistentVolumeClaim, config *auditlogpersistencev1alpha1.AuditLogPersistenceConfig) error { | ||
| // List all snapshots for this PVC | ||
| snapshotList := &snapshotv1.VolumeSnapshotList{} | ||
| if err := r.client.List(ctx, snapshotList, client.InNamespace(pod.Namespace), client.MatchingLabels{ | ||
| auditLogsPVCLabelKey: pvc.Name, | ||
| }); err != nil { | ||
| return fmt.Errorf("failed to list snapshots: %w", err) | ||
| } | ||
|
|
||
| // Sort snapshots by creation time (oldest first) | ||
| snapshots := snapshotList.Items | ||
| sortSnapshotsByCreationTime(snapshots) | ||
|
|
||
| // Per-pod retention: delete oldest snapshots if over limit | ||
| if len(snapshots) > int(config.Spec.Snapshots.PerPodRetentionCount) { | ||
| toDelete := len(snapshots) - int(config.Spec.Snapshots.PerPodRetentionCount) | ||
| for i := 0; i < toDelete; i++ { | ||
| if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) { | ||
| return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Namespace retention: list all snapshots in namespace and delete oldest if over limit | ||
| allSnapshots := &snapshotv1.VolumeSnapshotList{} | ||
| if err := r.client.List(ctx, allSnapshots, client.InNamespace(pod.Namespace), client.MatchingLabels{ | ||
| controlPlaneNamespaceLabelKey: pod.Namespace, | ||
| }); err != nil { | ||
| return fmt.Errorf("failed to list all snapshots in namespace: %w", err) | ||
| } | ||
|
|
||
| if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) { | ||
| sortSnapshotsByCreationTime(allSnapshots.Items) | ||
| toDelete := len(allSnapshots.Items) - int(config.Spec.Snapshots.NamespaceRetentionCount) | ||
| for i := 0; i < toDelete; i++ { | ||
| if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) { | ||
| return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defensively handle negative retention counts to avoid panics
manageRetention assumes config.Spec.Snapshots.PerPodRetentionCount and NamespaceRetentionCount are non-negative. If either is negative (which the current CRD allows), toDelete becomes larger than the number of snapshots and the for loop will index past the end of the slice, causing a controller panic.
Consider clamping negative retention values to zero before using them in arithmetic:
// Per-pod retention: delete oldest snapshots if over limit
- if len(snapshots) > int(config.Spec.Snapshots.PerPodRetentionCount) {
- toDelete := len(snapshots) - int(config.Spec.Snapshots.PerPodRetentionCount)
+ perPodLimit := config.Spec.Snapshots.PerPodRetentionCount
+ if perPodLimit < 0 {
+ perPodLimit = 0
+ }
+ if len(snapshots) > int(perPodLimit) {
+ toDelete := len(snapshots) - int(perPodLimit)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err)
}
}
}
@@
if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) {
- sortSnapshotsByCreationTime(allSnapshots.Items)
- toDelete := len(allSnapshots.Items) - int(config.Spec.Snapshots.NamespaceRetentionCount)
+ nsLimit := config.Spec.Snapshots.NamespaceRetentionCount
+ if nsLimit < 0 {
+ nsLimit = 0
+ }
+ sortSnapshotsByCreationTime(allSnapshots.Items)
+ toDelete := len(allSnapshots.Items) - int(nsLimit)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err)
}
}Combined with CRD-side Minimum=0 validation, this prevents misconfiguration from taking down the snapshot controller.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (r *SnapshotReconciler) manageRetention(ctx context.Context, pod *corev1.Pod, pvc *corev1.PersistentVolumeClaim, config *auditlogpersistencev1alpha1.AuditLogPersistenceConfig) error { | |
| // List all snapshots for this PVC | |
| snapshotList := &snapshotv1.VolumeSnapshotList{} | |
| if err := r.client.List(ctx, snapshotList, client.InNamespace(pod.Namespace), client.MatchingLabels{ | |
| auditLogsPVCLabelKey: pvc.Name, | |
| }); err != nil { | |
| return fmt.Errorf("failed to list snapshots: %w", err) | |
| } | |
| // Sort snapshots by creation time (oldest first) | |
| snapshots := snapshotList.Items | |
| sortSnapshotsByCreationTime(snapshots) | |
| // Per-pod retention: delete oldest snapshots if over limit | |
| if len(snapshots) > int(config.Spec.Snapshots.PerPodRetentionCount) { | |
| toDelete := len(snapshots) - int(config.Spec.Snapshots.PerPodRetentionCount) | |
| for i := 0; i < toDelete; i++ { | |
| if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err) | |
| } | |
| } | |
| } | |
| // Namespace retention: list all snapshots in namespace and delete oldest if over limit | |
| allSnapshots := &snapshotv1.VolumeSnapshotList{} | |
| if err := r.client.List(ctx, allSnapshots, client.InNamespace(pod.Namespace), client.MatchingLabels{ | |
| controlPlaneNamespaceLabelKey: pod.Namespace, | |
| }); err != nil { | |
| return fmt.Errorf("failed to list all snapshots in namespace: %w", err) | |
| } | |
| if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) { | |
| sortSnapshotsByCreationTime(allSnapshots.Items) | |
| toDelete := len(allSnapshots.Items) - int(config.Spec.Snapshots.NamespaceRetentionCount) | |
| for i := 0; i < toDelete; i++ { | |
| if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err) | |
| } | |
| } | |
| } | |
| func (r *SnapshotReconciler) manageRetention(ctx context.Context, pod *corev1.Pod, pvc *corev1.PersistentVolumeClaim, config *auditlogpersistencev1alpha1.AuditLogPersistenceConfig) error { | |
| // List all snapshots for this PVC | |
| snapshotList := &snapshotv1.VolumeSnapshotList{} | |
| if err := r.client.List(ctx, snapshotList, client.InNamespace(pod.Namespace), client.MatchingLabels{ | |
| auditLogsPVCLabelKey: pvc.Name, | |
| }); err != nil { | |
| return fmt.Errorf("failed to list snapshots: %w", err) | |
| } | |
| // Sort snapshots by creation time (oldest first) | |
| snapshots := snapshotList.Items | |
| sortSnapshotsByCreationTime(snapshots) | |
| // Per-pod retention: delete oldest snapshots if over limit | |
| perPodLimit := config.Spec.Snapshots.PerPodRetentionCount | |
| if perPodLimit < 0 { | |
| perPodLimit = 0 | |
| } | |
| if len(snapshots) > int(perPodLimit) { | |
| toDelete := len(snapshots) - int(perPodLimit) | |
| for i := 0; i < toDelete; i++ { | |
| if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err) | |
| } | |
| } | |
| } | |
| // Namespace retention: list all snapshots in namespace and delete oldest if over limit | |
| allSnapshots := &snapshotv1.VolumeSnapshotList{} | |
| if err := r.client.List(ctx, allSnapshots, client.InNamespace(pod.Namespace), client.MatchingLabels{ | |
| controlPlaneNamespaceLabelKey: pod.Namespace, | |
| }); err != nil { | |
| return fmt.Errorf("failed to list all snapshots in namespace: %w", err) | |
| } | |
| if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) { | |
| nsLimit := config.Spec.Snapshots.NamespaceRetentionCount | |
| if nsLimit < 0 { | |
| nsLimit = 0 | |
| } | |
| sortSnapshotsByCreationTime(allSnapshots.Items) | |
| toDelete := len(allSnapshots.Items) - int(nsLimit) | |
| for i := 0; i < toDelete; i++ { | |
| if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) { | |
| return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err) | |
| } | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
In hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
around lines 215 to 254, the code assumes
config.Spec.Snapshots.PerPodRetentionCount and NamespaceRetentionCount are
non-negative which can cause slice index OOB and a panic if they are negative;
clamp both retention counts to a non-negative int before using them (read the
CRD values into local ints, if less than zero set to 0) and then use those
sanitized ints for the length comparisons and toDelete calculations for both
per-pod and namespace retention logic.
What this PR does / why we need it:
Introduces a new audit log persistence feature that enables hosted clusters to store kube-apiserver audit logs on persistent volumes with automatic snapshot capabilities for crash recovery and forensic analysis.
Key components:
The feature is opt-in via the --enable-audit-log-persistence flag during HyperShift operator installation, ensuring backward compatibility and allowing gradual rollout across management clusters.
Architecture:
Commit-Message-Assisted-by: Claude (via Claude Code)
Which issue(s) this PR fixes:
Fixes CNTRLPLANE-1772
Special notes for your reviewer:
Checklist: