Skip to content

Conversation

@csrwng
Copy link
Contributor

@csrwng csrwng commented Nov 16, 2025

What this PR does / why we need it:

Introduces a new audit log persistence feature that enables hosted clusters to store kube-apiserver audit logs on persistent volumes with automatic snapshot capabilities for crash recovery and forensic analysis.

Key components:

  • New AuditLogPersistenceConfig CRD for management cluster administrators
  • Mutating webhooks to inject PVCs and volume mounts into kube-apiserver pods
  • Snapshot controller that automatically creates VolumeSnapshots on pod crashes
  • ConfigMap webhook to configure audit log rotation settings
  • Configurable retention policies for snapshot management

The feature is opt-in via the --enable-audit-log-persistence flag during HyperShift operator installation, ensuring backward compatibility and allowing gradual rollout across management clusters.

Architecture:

  • API definitions in api/auditlogpersistence/v1alpha1/
  • Controller and webhook logic in hypershift-operator/controllers/auditlogpersistence/
  • Installation and RBAC changes in cmd/install/
  • Integration with hypershift-operator main.go for webhook registration

Commit-Message-Assisted-by: Claude (via Claude Code)

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-1772

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Introduces a new audit log persistence feature that enables hosted clusters
to store kube-apiserver audit logs on persistent volumes with automatic
snapshot capabilities for crash recovery and forensic analysis.

Key components:
- New AuditLogPersistenceConfig CRD for management cluster administrators
- Mutating webhooks to inject PVCs and volume mounts into kube-apiserver pods
- Snapshot controller that automatically creates VolumeSnapshots on pod crashes
- ConfigMap webhook to configure audit log rotation settings
- Configurable retention policies for snapshot management

The feature is opt-in via the --enable-audit-log-persistence flag during
HyperShift operator installation, ensuring backward compatibility and allowing
gradual rollout across management clusters.

Architecture:
- API definitions in api/auditlogpersistence/v1alpha1/
- Controller and webhook logic in hypershift-operator/controllers/auditlogpersistence/
- Installation and RBAC changes in cmd/install/
- Integration with hypershift-operator main.go for webhook registration

Signed-off-by: Cesar Wong <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 16, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 16, 2025

@csrwng: This pull request references CNTRLPLANE-1772 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

What this PR does / why we need it:

Introduces a new audit log persistence feature that enables hosted clusters to store kube-apiserver audit logs on persistent volumes with automatic snapshot capabilities for crash recovery and forensic analysis.

Key components:

  • New AuditLogPersistenceConfig CRD for management cluster administrators
  • Mutating webhooks to inject PVCs and volume mounts into kube-apiserver pods
  • Snapshot controller that automatically creates VolumeSnapshots on pod crashes
  • ConfigMap webhook to configure audit log rotation settings
  • Configurable retention policies for snapshot management

The feature is opt-in via the --enable-audit-log-persistence flag during HyperShift operator installation, ensuring backward compatibility and allowing gradual rollout across management clusters.

Architecture:

  • API definitions in api/auditlogpersistence/v1alpha1/
  • Controller and webhook logic in hypershift-operator/controllers/auditlogpersistence/
  • Installation and RBAC changes in cmd/install/
  • Integration with hypershift-operator main.go for webhook registration

Commit-Message-Assisted-by: Claude (via Claude Code)

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-1772

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 16, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Nov 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 16, 2025

Walkthrough

The PR introduces a new audit log persistence feature for HyperShift, adding a custom resource definition (CRD) to configure persistent audit logging, mutating webhooks to inject storage and log configuration into kube-apiserver pods and ConfigMaps, and a snapshot controller to manage automatic volume snapshots with retention policies.

Changes

Cohort / File(s) Summary
CRD API Types
api/auditlogpersistence/register.go, api/auditlogpersistence/v1alpha1/doc.go, api/auditlogpersistence/v1alpha1/register.go, api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go
Introduces auditlogpersistence API group with v1alpha1 AuditLogPersistenceConfig type, including nested StorageConfig, AuditLogConfig, and SnapshotConfig specs. Adds codegen annotations and scheme registration helpers.
Generated Deepcopy Methods
api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go
Auto-generated deepcopy support for all CRD types and nested structures.
CRD Manifest
cmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yaml
Kubernetes CustomResourceDefinition manifest for AuditLogPersistenceConfig with cluster scope, OpenAPI v3 schema validation, defaults, patterns, and validation rules.
Build System
Makefile
Adds CRD generation step for auditlogpersistence API paths.
Operator Component Configuration
cmd/install/assets/hypershift_operator.go
Adds EnableAuditLogPersistence flag to HyperShiftOperatorDeployment, HyperShiftOperatorClusterRole, and HyperShiftMutatingWebhookConfiguration, conditionally wiring RBAC rules, webhooks, and environment variables.
Install CLI and Wiring
cmd/install/install.go
Introduces --enable-audit-log-persistence CLI flag and propagates EnableAuditLogPersistence through operator deployment, webhook, and RBAC configuration; conditionally includes auditlogpersistence CRD in setupCRDs.
ConfigMap Webhook
hypershift-operator/controllers/auditlogpersistence/configmap_webhook.go, hypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.go
Mutating webhook handler that intercepts kas-config ConfigMap mutations and injects kube-apiserver audit log settings (maxsize, maxbackup) from AuditLogPersistenceConfig. Includes comprehensive test coverage for JSON mutation scenarios.
Pod Webhook
hypershift-operator/controllers/auditlogpersistence/pod_webhook.go, hypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go
Mutating webhook handler that intercepts kube-apiserver Pod creation, creates or associates a PersistentVolumeClaim (PVC) for logs based on storage config, and replaces EmptyDir logs volume with PVC volume. Includes tests for PVC creation, naming, ownership, and storage configuration.
Snapshot Controller
hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go, hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
Reconciler that watches kube-apiserver Pods, detects container restarts, creates VolumeSnapshots for audit logs, and enforces per-pod and namespace-level retention policies. Includes extensive test suite for reconciliation, retention logic, and annotation management.
Operator Integration
hypershift-operator/main.go, support/api/scheme.go
Registers Pod and ConfigMap mutating webhooks when audit log persistence is enabled; sets up snapshot controller; adds auditlogpersistence v1alpha1 to the API scheme.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

  • ConfigMap webhook mutation logic: JSON parsing, nested field manipulation, and edge cases around apiServerArguments initialization require careful validation.
  • Pod webhook lifecycle management: PVC creation, ownership chaining via ReplicaSet, volume replacement logic, and name generation handling need thorough review.
  • Snapshot controller retention policy: Per-pod and namespace-level retention deletion logic, annotation state management, and interval gating are dense and error-prone.
  • Webhook registration and integration: Ensure proper namespace selector scoping, webhook path routing, and conditional enablement across operator components.
  • Error handling and state consistency: Each webhook and controller component has multiple error paths; verify graceful degradation and recovery from corrupted annotations.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Nov 16, 2025
@csrwng
Copy link
Contributor Author

csrwng commented Nov 16, 2025

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 16, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/install/install.go (1)

645-683: Audit log persistence CRD is not actually gated by the feature flag

In setupCRDs, the intention is to include audit log persistence CRDs only when opts.EnableAuditLogPersistence is true:

// Conditionally include auditlogpersistence CRD only if feature is enabled
if strings.Contains(path, "auditlogpersistence") {
    return opts.EnableAuditLogPersistence
}

However, this check currently appears after:

if strings.Contains(path, "hypershift-operator/") {
    return true
}

For the audit log persistence CRD, the path will contain "hypershift-operator/", so the function returns true before it ever reaches the auditlogpersistence condition. As a result, the CRD is always installed regardless of the flag, which contradicts the opt‑in design.

Reordering the checks fixes this:

-            if strings.Contains(path, "hypershift-operator/") {
-                return true
-            }
-            if strings.Contains(path, "cluster-api/") {
-                return true
-            }
-                // Conditionally include auditlogpersistence CRD only if feature is enabled
-                if strings.Contains(path, "auditlogpersistence") {
-                    return opts.EnableAuditLogPersistence
-                }
+            // Conditionally include auditlogpersistence CRDs only if the feature is enabled
+            if strings.Contains(path, "auditlogpersistence") {
+                return opts.EnableAuditLogPersistence
+            }
+            if strings.Contains(path, "hypershift-operator/") {
+                return true
+            }
+            if strings.Contains(path, "cluster-api/") {
+                return true
+            }

This ensures that the new CRD truly respects the feature flag.

🧹 Nitpick comments (4)
cmd/install/install.go (1)

558-563: Webhook, deployment, and RBAC plumbing for audit log persistence is coherent

  • The mutating webhook configuration is now created when either defaulting webhooks or audit log persistence are enabled, and the template receives EnableAuditLogPersistence to control which webhooks are actually configured.
  • HyperShiftOperatorDeployment.EnableWebhook now accounts for the new feature, ensuring the operator exposes a webhook server when only audit log persistence is enabled.
  • HyperShiftOperatorDeployment.EnableAuditLogPersistence and HyperShiftOperatorClusterRole.EnableAuditLogPersistence give the assets layer enough context to inject the correct env vars and RBAC for the snapshot controller and webhooks.

One follow‑up you may want to consider: updating ApplyDefaults to set HyperShiftOperatorReplicas = 2 when only EnableAuditLogPersistence is true, to keep webhook availability semantics consistent with other webhook‑driven features.

Example refinement:

 func (o *Options) ApplyDefaults() {
   o.RenderNamespace = true
   switch {
   case o.Development:
     o.HyperShiftOperatorReplicas = 0
-  case o.EnableDefaultingWebhook || o.EnableConversionWebhook || o.EnableValidatingWebhook:
+  case o.EnableDefaultingWebhook || o.EnableConversionWebhook || o.EnableValidatingWebhook || o.EnableAuditLogPersistence:
     o.HyperShiftOperatorReplicas = 2
   default:
     o.HyperShiftOperatorReplicas = 1
   }
 }

Also applies to: 789-799, 826-827, 967-977

hypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.go (1)

80-104: Tighten assertion for early-return path (no config.json key)

In the “ConfigMap without config.json key returns early” case you only assert cm.Data[kubeAPIServerConfigKey] is empty. Because a missing key also reads as "", this would still pass if mutateConfigMap started writing an empty config.json entry.

To make the test guard the early-return behavior more strictly, consider also (or instead) asserting that the key does not exist:

configJSON, exists := cm.Data[kubeAPIServerConfigKey]
g.Expect(exists).To(BeFalse())
g.Expect(configJSON).To(Equal(""))

This keeps the intent clear: the webhook should not touch the key at all when it’s absent.

cmd/install/assets/hypershift_operator.go (1)

1007-1042: Audit-log RBAC block partially redundant with existing ClusterRole rules

Within HyperShiftOperatorClusterRole.Build, the conditional RBAC block for audit log persistence adds rules for PVCs, pods, and configmaps, but the earlier core-group rule already grants rbacv1.VerbAll on those resources:

{
    APIGroups: []string{""},
    Resources: []string{
        "events", "configmaps", "configmaps/finalizers",
        "persistentvolumeclaims", "pods", "pods/log", ...
    },
    Verbs: []string{rbacv1.VerbAll},
}

So the new core rules under if o.EnableAuditLogPersistence { ... } don’t change effective privileges and mainly add noise. The only materially new capability is list/watch for snapshot.storage.k8s.io/volumesnapshots, which is meaningful on top of the existing get/create/delete rule.

Consider trimming the conditional block to just the volumesnapshot rule (or, if you intend to truly gate core permissions by this feature, moving the PVC/pod/configmap access out of the broad VerbAll rule into feature-specific blocks—but that’s a larger behavior change).

Also applies to: 1310-1338

hypershift-operator/controllers/auditlogpersistence/configmap_webhook.go (1)

43-110: Refactor is optional but recommended for robustness; current code is functionally correct

The web search confirms that AdmissionRequest.name is populated when the client provides explicit metadata.name, but will be empty if the client omits metadata.name (e.g., uses generateName). However, analysis of the codebase shows that kas-config is a system-managed ConfigMap always created with explicit metadata.name in all fixtures and tests—never via generateName. This means the current check on req.Name is functionally correct for this use case.

That said, the suggested refactor to check configMap.Name after decoding remains a worthwhile defensive improvement. It makes the handler agnostic to how AdmissionRequest populates the name field and is more maintainable should the creation pattern change.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e83d907 and 29879fa.

⛔ Files ignored due to path filters (6)
  • vendor/github.com/openshift/hypershift/api/auditlogpersistence/register.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/doc.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/register.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/openshift/hypershift/api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (17)
  • Makefile (1 hunks)
  • api/auditlogpersistence/register.go (1 hunks)
  • api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go (1 hunks)
  • api/auditlogpersistence/v1alpha1/doc.go (1 hunks)
  • api/auditlogpersistence/v1alpha1/register.go (1 hunks)
  • api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • cmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yaml (1 hunks)
  • cmd/install/assets/hypershift_operator.go (7 hunks)
  • cmd/install/install.go (7 hunks)
  • hypershift-operator/controllers/auditlogpersistence/configmap_webhook.go (1 hunks)
  • hypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.go (1 hunks)
  • hypershift-operator/controllers/auditlogpersistence/pod_webhook.go (1 hunks)
  • hypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go (1 hunks)
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go (1 hunks)
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go (1 hunks)
  • hypershift-operator/main.go (3 hunks)
  • support/api/scheme.go (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • Makefile
  • api/auditlogpersistence/register.go
  • cmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yaml
  • api/auditlogpersistence/v1alpha1/register.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go
  • support/api/scheme.go
  • hypershift-operator/controllers/auditlogpersistence/configmap_webhook_test.go
  • hypershift-operator/controllers/auditlogpersistence/pod_webhook.go
  • hypershift-operator/controllers/auditlogpersistence/configmap_webhook.go
  • cmd/install/install.go
  • hypershift-operator/main.go
  • api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go
  • api/auditlogpersistence/v1alpha1/doc.go
  • api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go
  • cmd/install/assets/hypershift_operator.go
  • hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
  • hypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Red Hat Konflux / control-plane-operator-main-on-pull-request
  • GitHub Check: Red Hat Konflux / hypershift-operator-main-on-pull-request
  • GitHub Check: Red Hat Konflux / hypershift-release-mce-211-on-pull-request
  • GitHub Check: Red Hat Konflux / hypershift-cli-mce-211-on-pull-request
🔇 Additional comments (12)
api/auditlogpersistence/register.go (1)

1-3: Group name definition looks correct and consistent

GroupName matches the v1alpha1 package annotations and registration usage; this provides a clean anchor for the new API group.

Makefile (1)

191-195: CRD generation wiring for auditlogpersistence is consistent

Adding ./api/auditlogpersistence/... to the CRD generation step mirrors existing patterns (scheduling, certificates) and ensures assets stay in sync with the API types.

support/api/scheme.go (1)

6-7: Scheme registration for audit log persistence API is correctly wired

Importing auditlogpersistencev1alpha1 and calling AddToScheme alongside other HyperShift APIs is consistent with the existing scheme setup and enables the new CRD types to be used by clients and controllers.

Also applies to: 88-107

hypershift-operator/main.go (1)

28-29: Audit log persistence webhooks and controller are wired in coherently

The new imports and the ENABLE_AUDIT_LOG_PERSISTENCE‑gated block correctly:

  • Register pod and ConfigMap mutating webhooks only when TLS certs are configured (non‑empty CertDir), and
  • Always wire the snapshot controller when the feature is enabled.

This keeps the feature optional and avoids impacting clusters where the env var is unset, while reusing the manager’s scheme, logger, and client in a standard way.

Also applies to: 75-76, 547-580

api/auditlogpersistence/v1alpha1/doc.go (1)

1-4: Codegen annotations for v1alpha1 are appropriate

The deepcopy, groupName, and openapi annotations are standard for CRD API packages and align with the defined group.

api/auditlogpersistence/v1alpha1/register.go (1)

1-37: v1alpha1 scheme registration follows Kubernetes conventions

SchemeGroupVersion, Kind/Resource helpers, and addKnownTypes correctly register AuditLogPersistenceConfig and its list type into the scheme with the expected group/version.

cmd/install/install.go (1)

77-83: New install option and flag are threaded cleanly

Introducing EnableAuditLogPersistence on Options and exposing it via --enable-audit-log-persistence integrates well with the existing install flow, keeping the feature opt‑in from the CLI down into manifest generation.

Also applies to: 132-133, 281-282

cmd/install/assets/hypershift-operator/auditlogpersistence.hypershift.openshift.io_auditlogpersistenceconfigs.yaml (1)

1-203: CRD schema, defaults, and validation look consistent

The CRD shape (cluster-scoped, single cluster instance, sane defaults for auditLog/snapshots/storage, and regexes for duration/quantity) aligns with the described feature and shouldn’t pose compatibility issues.

hypershift-operator/controllers/auditlogpersistence/snapshot_controller_test.go (1)

54-1276: Comprehensive snapshot controller test coverage looks solid

The table-driven tests around SnapshotReconciler (reconcile paths, retention management, timestamp sorting, integer parsing, pod classification, and snapshot creation) give broad coverage of success and edge conditions, and the helper functions make side-effect checks clear and reusable. No major issues stand out here.

hypershift-operator/controllers/auditlogpersistence/pod_webhook_test.go (1)

21-561: Comprehensive mutatePod coverage looks solid

The table-driven tests exercise all key mutatePod branches (name vs generateName, missing names, ReplicaSet ownership, existing PVC, storage class handling, and volume wiring), using a fake client + scheme correctly. This gives good confidence in the PVC and volume mutation behavior.

hypershift-operator/controllers/auditlogpersistence/pod_webhook.go (1)

49-127: Admission handler + mutation flow look correct and well-scoped

The webhook correctly narrows scope to kube-apiserver Pods in hosted control-plane namespaces, checks the cluster-scoped AuditLogPersistenceConfig and Spec.Enabled, deep-copies the Pod before mutation, and returns a patch response. Combined with mutatePod, the PVC creation and logs volume rewiring for kube-apiserver Pods are implemented cleanly without obvious correctness or security pitfalls.

api/auditlogpersistence/v1alpha1/zz_generated.deepcopy.go (1)

1-171: Generated deepcopy code is consistent with the v1alpha1 types

This file appears to be standard controller-gen output for the new audit log persistence types, with correct handling of nested structs and slices. No changes needed here.

Comment on lines +83 to +91
// PerPodRetentionCount is the maximum number of snapshots to retain per PVC.
// When this limit is reached, the oldest snapshot for that PVC will be deleted.
// +kubebuilder:default=10
PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"`

// NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace.
// When this limit is reached, the oldest snapshot in the namespace will be deleted.
// +kubebuilder:default=50
NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Enforce non-negative snapshot retention counts at the API level

PerPodRetentionCount and NamespaceRetentionCount are currently unbounded int32 values with defaults but no minimum validation. In manageRetention (Line 215 onwards in snapshot_controller.go), negative values will lead to toDelete exceeding the slice length and cause a panic when indexing into the snapshots slice.

To prevent invalid configs from ever reaching the controller, add Minimum=0 validation on both fields:

 	// PerPodRetentionCount is the maximum number of snapshots to retain per PVC.
 	// When this limit is reached, the oldest snapshot for that PVC will be deleted.
 	// +kubebuilder:default=10
+	// +kubebuilder:validation:Minimum=0
 	PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"`

 	// NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace.
 	// When this limit is reached, the oldest snapshot in the namespace will be deleted.
 	// +kubebuilder:default=50
+	// +kubebuilder:validation:Minimum=0
 	NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"`

This keeps the CRD self-defensive and aligns with the expectations in the controller logic.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// PerPodRetentionCount is the maximum number of snapshots to retain per PVC.
// When this limit is reached, the oldest snapshot for that PVC will be deleted.
// +kubebuilder:default=10
PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"`
// NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace.
// When this limit is reached, the oldest snapshot in the namespace will be deleted.
// +kubebuilder:default=50
NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"`
// PerPodRetentionCount is the maximum number of snapshots to retain per PVC.
// When this limit is reached, the oldest snapshot for that PVC will be deleted.
// +kubebuilder:default=10
// +kubebuilder:validation:Minimum=0
PerPodRetentionCount int32 `json:"perPodRetentionCount,omitempty"`
// NamespaceRetentionCount is the maximum total number of snapshots to retain per namespace.
// When this limit is reached, the oldest snapshot in the namespace will be deleted.
// +kubebuilder:default=50
// +kubebuilder:validation:Minimum=0
NamespaceRetentionCount int32 `json:"namespaceRetentionCount,omitempty"`
🤖 Prompt for AI Agents
In api/auditlogpersistence/v1alpha1/auditlogpersistenceconfig_types.go around
lines 83 to 91, the PerPodRetentionCount and NamespaceRetentionCount fields lack
minimum validation which allows negative values and can cause panics in the
controller; add kubebuilder validation tags to enforce a non-negative minimum
(Minimum=0) on both fields so the generated CRD rejects negative values and
aligns with controller expectations.

Comment on lines +215 to +254
func (r *SnapshotReconciler) manageRetention(ctx context.Context, pod *corev1.Pod, pvc *corev1.PersistentVolumeClaim, config *auditlogpersistencev1alpha1.AuditLogPersistenceConfig) error {
// List all snapshots for this PVC
snapshotList := &snapshotv1.VolumeSnapshotList{}
if err := r.client.List(ctx, snapshotList, client.InNamespace(pod.Namespace), client.MatchingLabels{
auditLogsPVCLabelKey: pvc.Name,
}); err != nil {
return fmt.Errorf("failed to list snapshots: %w", err)
}

// Sort snapshots by creation time (oldest first)
snapshots := snapshotList.Items
sortSnapshotsByCreationTime(snapshots)

// Per-pod retention: delete oldest snapshots if over limit
if len(snapshots) > int(config.Spec.Snapshots.PerPodRetentionCount) {
toDelete := len(snapshots) - int(config.Spec.Snapshots.PerPodRetentionCount)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err)
}
}
}

// Namespace retention: list all snapshots in namespace and delete oldest if over limit
allSnapshots := &snapshotv1.VolumeSnapshotList{}
if err := r.client.List(ctx, allSnapshots, client.InNamespace(pod.Namespace), client.MatchingLabels{
controlPlaneNamespaceLabelKey: pod.Namespace,
}); err != nil {
return fmt.Errorf("failed to list all snapshots in namespace: %w", err)
}

if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) {
sortSnapshotsByCreationTime(allSnapshots.Items)
toDelete := len(allSnapshots.Items) - int(config.Spec.Snapshots.NamespaceRetentionCount)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Defensively handle negative retention counts to avoid panics

manageRetention assumes config.Spec.Snapshots.PerPodRetentionCount and NamespaceRetentionCount are non-negative. If either is negative (which the current CRD allows), toDelete becomes larger than the number of snapshots and the for loop will index past the end of the slice, causing a controller panic.

Consider clamping negative retention values to zero before using them in arithmetic:

 	// Per-pod retention: delete oldest snapshots if over limit
-	if len(snapshots) > int(config.Spec.Snapshots.PerPodRetentionCount) {
-		toDelete := len(snapshots) - int(config.Spec.Snapshots.PerPodRetentionCount)
+	perPodLimit := config.Spec.Snapshots.PerPodRetentionCount
+	if perPodLimit < 0 {
+		perPodLimit = 0
+	}
+	if len(snapshots) > int(perPodLimit) {
+		toDelete := len(snapshots) - int(perPodLimit)
 		for i := 0; i < toDelete; i++ {
 			if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) {
 				return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err)
 			}
 		}
 	}
@@
 	if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) {
-		sortSnapshotsByCreationTime(allSnapshots.Items)
-		toDelete := len(allSnapshots.Items) - int(config.Spec.Snapshots.NamespaceRetentionCount)
+		nsLimit := config.Spec.Snapshots.NamespaceRetentionCount
+		if nsLimit < 0 {
+			nsLimit = 0
+		}
+		sortSnapshotsByCreationTime(allSnapshots.Items)
+		toDelete := len(allSnapshots.Items) - int(nsLimit)
 		for i := 0; i < toDelete; i++ {
 			if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) {
 				return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err)
 			}
 		}

Combined with CRD-side Minimum=0 validation, this prevents misconfiguration from taking down the snapshot controller.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (r *SnapshotReconciler) manageRetention(ctx context.Context, pod *corev1.Pod, pvc *corev1.PersistentVolumeClaim, config *auditlogpersistencev1alpha1.AuditLogPersistenceConfig) error {
// List all snapshots for this PVC
snapshotList := &snapshotv1.VolumeSnapshotList{}
if err := r.client.List(ctx, snapshotList, client.InNamespace(pod.Namespace), client.MatchingLabels{
auditLogsPVCLabelKey: pvc.Name,
}); err != nil {
return fmt.Errorf("failed to list snapshots: %w", err)
}
// Sort snapshots by creation time (oldest first)
snapshots := snapshotList.Items
sortSnapshotsByCreationTime(snapshots)
// Per-pod retention: delete oldest snapshots if over limit
if len(snapshots) > int(config.Spec.Snapshots.PerPodRetentionCount) {
toDelete := len(snapshots) - int(config.Spec.Snapshots.PerPodRetentionCount)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err)
}
}
}
// Namespace retention: list all snapshots in namespace and delete oldest if over limit
allSnapshots := &snapshotv1.VolumeSnapshotList{}
if err := r.client.List(ctx, allSnapshots, client.InNamespace(pod.Namespace), client.MatchingLabels{
controlPlaneNamespaceLabelKey: pod.Namespace,
}); err != nil {
return fmt.Errorf("failed to list all snapshots in namespace: %w", err)
}
if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) {
sortSnapshotsByCreationTime(allSnapshots.Items)
toDelete := len(allSnapshots.Items) - int(config.Spec.Snapshots.NamespaceRetentionCount)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err)
}
}
}
func (r *SnapshotReconciler) manageRetention(ctx context.Context, pod *corev1.Pod, pvc *corev1.PersistentVolumeClaim, config *auditlogpersistencev1alpha1.AuditLogPersistenceConfig) error {
// List all snapshots for this PVC
snapshotList := &snapshotv1.VolumeSnapshotList{}
if err := r.client.List(ctx, snapshotList, client.InNamespace(pod.Namespace), client.MatchingLabels{
auditLogsPVCLabelKey: pvc.Name,
}); err != nil {
return fmt.Errorf("failed to list snapshots: %w", err)
}
// Sort snapshots by creation time (oldest first)
snapshots := snapshotList.Items
sortSnapshotsByCreationTime(snapshots)
// Per-pod retention: delete oldest snapshots if over limit
perPodLimit := config.Spec.Snapshots.PerPodRetentionCount
if perPodLimit < 0 {
perPodLimit = 0
}
if len(snapshots) > int(perPodLimit) {
toDelete := len(snapshots) - int(perPodLimit)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &snapshots[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", snapshots[i].Name, err)
}
}
}
// Namespace retention: list all snapshots in namespace and delete oldest if over limit
allSnapshots := &snapshotv1.VolumeSnapshotList{}
if err := r.client.List(ctx, allSnapshots, client.InNamespace(pod.Namespace), client.MatchingLabels{
controlPlaneNamespaceLabelKey: pod.Namespace,
}); err != nil {
return fmt.Errorf("failed to list all snapshots in namespace: %w", err)
}
if len(allSnapshots.Items) > int(config.Spec.Snapshots.NamespaceRetentionCount) {
nsLimit := config.Spec.Snapshots.NamespaceRetentionCount
if nsLimit < 0 {
nsLimit = 0
}
sortSnapshotsByCreationTime(allSnapshots.Items)
toDelete := len(allSnapshots.Items) - int(nsLimit)
for i := 0; i < toDelete; i++ {
if err := r.client.Delete(ctx, &allSnapshots.Items[i]); err != nil && !apierrors.IsNotFound(err) {
return fmt.Errorf("failed to delete snapshot %s: %w", allSnapshots.Items[i].Name, err)
}
}
}
}
🤖 Prompt for AI Agents
In hypershift-operator/controllers/auditlogpersistence/snapshot_controller.go
around lines 215 to 254, the code assumes
config.Spec.Snapshots.PerPodRetentionCount and NamespaceRetentionCount are
non-negative which can cause slice index OOB and a panic if they are negative;
clamp both retention counts to a non-negative int before using them (read the
CRD values into local ints, if less than zero set to 0) and then use those
sanitized ints for the length comparisons and toDelete calculations for both
per-pod and namespace retention logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Indicates the PR includes changes for the API area/cli Indicates the PR includes changes for CLI area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants