Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217
Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217jkhelil wants to merge 1 commit intotektoncd:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
b1a9ca5 to
e803a2e
Compare
|
/kind bug |
|
@jkhelil - could you help with steps to reproduce the issue? |
| } | ||
|
|
||
| // Resource is being deleted by another controller/InstallerSet, skip it | ||
| ressourceLogger.Debug("resource is being deleted by another owner, skipping") |
There was a problem hiding this comment.
Would it be ok to log the finalizer name as well and Deletion time stamp here?
to reproduce, install once, have some workload, check finaliers are there, delete tektonconfig, reinstall or do an upgrade |
e803a2e to
2188fed
Compare
|
@jkhelil - Attempted to test this implementation and have the following observation. Installersets especially the pipeline static installerset that manages the CRD (tested this with CRD taskruns.tekton.dev) gets into deadlock, if any given CRD has a finalizer and marked for deletion. With the current PR implementation, we skip CRD reconciliation if the resource does not belong to the new Installerset. However, this behavior prevents the new Installerset version from reconciling and installing the required CRDs. What I observed was the static installersets that reconciles the CRDs does not create new instance of CRD if any of the CRDs are deleted or missing. If we skip this during deadlock, we end up in a situation where PRS/TRs will not be accpeted to the cluster unless the user manually figures out and has this reconciled by forcibly deleting the pipeline static installerset. Hence instead of skipping, could we attempt to remove the finalizers with proper logging so that stale resource is removed and new CRD is created properly? Sample code block generated with Claude for reference: Thank you |
2188fed to
a40b5fe
Compare
|
/hold |
|
@jkhelil - Apologize for being so fussy about this PR. Thank you for the update. On further testing observed the following issue: When multiple CRDs are stuck in terminating state simultaneously (e.g., both pipelineruns.tekton.dev and taskruns.tekton.dev holding the customresourcecleanup.apiextensions.k8s.io finalizer), the current fix clears only one CRD's finalizer per reconcile pass because it did an early return RECONCILE_AGAIN_ERR immediately after the first successful finalizer removal — exiting the loop before reaching subsequent CRDs. This produced the "alternating" behavior observed in logs: Pass N: clears pipelineruns → returns, taskruns untouched Proposed fix derived with help from Claude: Replace the early return with a needsReconcileAgain = true + continue pattern, so all stuck CRDs have their finalizers cleared in a single reconcile pass. The RECONCILE_AGAIN_ERR is returned once after the loop completes. Proposed install.go for reference - https://gist.github.com/anithapriyanatarajan/9b78f1b09645f4adc6aea9547194fdb5 Have also included another minor Sorry again for delaying with this. But I think these fixes will help us completely get out of the upgrade issues due to stale PRs/TRs Steps to capture the edge case:
Observed that only on alternate deletions the UIDs were updated. With that minor enhancement to iterate through all CRDs before existingm this issue was gone. Thankyou |
|
/hold cancel |
Changes
Fixes #2474
The operator enters a deadlock when any resource (e.g., CRD) has a
deletionTimestampduring InstallerSet reconciliation. The current code immediately aborts the entire reconciliation phase, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created.Symptoms:
openshift-pipelinesnamespaceserviceaccount not found)Impact: Complete operator failure during installations, upgrades, downgrades, or recovery operations.
Root Cause
Location:
pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
make test lintbefore submitting a PRSee the contribution guide for more details.
Release Notes