Fix TektonInstallerSet deadlock when resources have deletionTimestamp by jkhelil · Pull Request #3217 · tektoncd/operator

jkhelil · 2026-02-14T11:33:46Z

Changes

Added installerSetName parameter to ensureResources() and all callers
Checks OwnerReferences to determine if resource belongs to this InstallerSet
Only waits for owned resources, skips others
Allows reconciliation to continue even with TERMINATING CRDs

The operator enters a deadlock when any resource (e.g., CRD) has a deletionTimestamp during InstallerSet reconciliation. The current code immediately aborts the entire reconciliation phase, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created.

Symptoms:

No ServiceAccounts created in openshift-pipelines namespace
InstallerSets stuck with "reconcile again and proceed"
No component pods running (Deployments fail: serviceaccount not found)
Webhooks unavailable → TektonConfig can't reconcile
Operator logs show infinite CRD fetching loop

Impact: Complete operator failure during installations, upgrades, downgrades, or recovery operations.

Root Cause

Location: pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Run make test lint before submitting a PR
Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide for more details.

Release Notes

NONE

tekton-robot · 2026-02-14T11:33:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from jkhelil after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jkhelil · 2026-02-17T06:05:07Z

@vdemeester @anithapriyanatarajan PTAL

anithapriyanatarajan · 2026-02-17T11:40:04Z

/kind bug

anithapriyanatarajan · 2026-02-18T02:44:50Z

@jkhelil - could you help with steps to reproduce the issue?

anithapriyanatarajan · 2026-02-18T02:47:47Z

pkg/reconciler/kubernetes/tektoninstallerset/install.go

+			}
+
+			// Resource is being deleted by another controller/InstallerSet, skip it
+			ressourceLogger.Debug("resource is being deleted by another owner, skipping")


Would it be ok to log the finalizer name as well and Deletion time stamp here?

jkhelil · 2026-02-18T05:25:36Z

@jkhelil - could you help with steps to reproduce the issue?
It’s a somewhat strange error that we’ve been carrying across several versions. if you remember all the issues people encountered during upgrades
what happens:
Some TaskRuns or PipelineRuns have finalizers. A user tries to delete TektonConfig to uninstall the operator because they notice a problem during an upgrade, or because teams are experiencing etcd or infrastructure issues.
When TektonConfig is deleted, the CRDs are also removed, but they get stuck due to the finalizers on the TaskRuns. The user doesn’t see this because they delete TektonConfig and don’t monitor the remaining resources. TektonConfig appears as deleted.
The user then tries to reinstall or upgrade using a new CSV, but it doesn’t work. This is because the installer set is stuck in a loop: it checks whether there is a deletion timestamp on the CRDs and keeps reconciling again. The loop never proceeds to the next step (installing the service accounts, RBAC, etc.).
The user ends up seeing an error related to the webhook service account, but in reality there is nothing wrong with it — the installer set is simply stuck continuously reconciling.

to reproduce, install once, have some workload, check finaliers are there, delete tektonconfig, reinstall or do an upgrade

anithapriyanatarajan · 2026-02-24T07:00:49Z

@jkhelil - Attempted to test this implementation and have the following observation.

Installersets especially the pipeline static installerset that manages the CRD (tested this with CRD taskruns.tekton.dev) gets into deadlock, if any given CRD has a finalizer and marked for deletion.

With the current PR implementation, we skip CRD reconciliation if the resource does not belong to the new Installerset. However, this behavior prevents the new Installerset version from reconciling and installing the required CRDs.

What I observed was the static installersets that reconciles the CRDs does not create new instance of CRD if any of the CRDs are deleted or missing. If we skip this during deadlock, we end up in a situation where PRS/TRs will not be accpeted to the cluster unless the user manually figures out and has this reconciled by forcibly deleting the pipeline static installerset.

Hence instead of skipping, could we attempt to remove the finalizers with proper logging so that stale resource is removed and new CRD is created properly? Sample code block generated with Claude for reference:

    ressourceLogger.Warn("force deleting stale resource to break deadlock",
        "deletionTimestamp", res.GetDeletionTimestamp(),
        "finalizers", res.GetFinalizers(),
        "oldOwners", res.GetOwnerReferences(),
        "reason", "resource not owned by current InstallerSet") 

    // Remove finalizers to complete deletion
    res.SetFinalizers([]string{})
    if err := i.mfClient.Update(res); err != nil {
        ressourceLogger.Error("failed to remove finalizers", "error", err)
        return err
    }
    
    // Reconcile again to create new resource
    return v1alpha1.RECONCILE_AGAIN_ERR

Thank you

anithapriyanatarajan · 2026-03-05T15:00:59Z

/hold

anithapriyanatarajan · 2026-03-05T15:54:29Z

@jkhelil - Apologize for being so fussy about this PR. Thank you for the update. On further testing observed the following issue:

When multiple CRDs are stuck in terminating state simultaneously (e.g., both pipelineruns.tekton.dev and taskruns.tekton.dev holding the customresourcecleanup.apiextensions.k8s.io finalizer), the current fix clears only one CRD's finalizer per reconcile pass because it did an early return RECONCILE_AGAIN_ERR immediately after the first successful finalizer removal — exiting the loop before reaching subsequent CRDs.

This produced the "alternating" behavior observed in logs:

Pass N: clears pipelineruns → returns, taskruns untouched
Pass N+1: clears taskruns → returns, pipelineruns re-checked
...and so on, doubling the time to recovery for every additional stuck CRD

Proposed fix derived with help from Claude: Replace the early return with a needsReconcileAgain = true + continue pattern, so all stuck CRDs have their finalizers cleared in a single reconcile pass. The RECONCILE_AGAIN_ERR is returned once after the loop completes. Proposed install.go for reference - https://gist.github.com/anithapriyanatarajan/9b78f1b09645f4adc6aea9547194fdb5

Have also included another minor return nil fix that stops spam of log entries - "failed to get resource: error=null" log entries. Though this is not related to this PR, this fix is very insignificant but keeps the logs clean.

Sorry again for delaying with this. But I think these fixes will help us completely get out of the upgrade issues due to stale PRs/TRs

Steps to capture the edge case:

make apply the PR changes on a kind cluster
Once all the pods are running in tekton-pipelines namespace, create simple PRs and allow to succeed.
Added a random finalizer finalizer.tekton.dev to the PR
executed kubectl delete tektonpipelines pipeline and watched if the ownerReference UIDs are chaning properly to the new pipeline CR's UID using the following commands
Watch update of ownerReference on PipelineRuns with kubectl get crd pipelineruns.tekton.dev -w -o json | jq '. | {name: .metadata.name, ownerUID: .metadata.ownerReferences[0].uid, resourceVersion: .metadata.resourceVersion, generation: .metadata.generation}'
Watched updated of OwnerReference on TaskRuns with kubectl get crd taskruns.tekton.dev -w -o json | jq '. | {name: .metadata.name, ownerUID: .metadata.ownerReferences[0].uid, resourceVersion: .metadata.resourceVersion, generation: .metadata.generation}'

Observed that only on alternate deletions the UIDs were updated. With that minor enhancement to iterate through all CRDs before existingm this issue was gone. Thankyou

anithapriyanatarajan · 2026-03-05T15:55:02Z

/hold cancel

tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Feb 14, 2026

tekton-robot requested review from PuneetPunamiya and pratap0007 February 14, 2026 11:33

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 14, 2026

jkhelil force-pushed the fix_deadlock branch from b1a9ca5 to e803a2e Compare February 16, 2026 14:41

jkhelil closed this Feb 17, 2026

jkhelil reopened this Feb 17, 2026

tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2026

anithapriyanatarajan reviewed Feb 18, 2026

View reviewed changes

anithapriyanatarajan requested changes Feb 18, 2026

View reviewed changes

jkhelil force-pushed the fix_deadlock branch from e803a2e to 2188fed Compare February 18, 2026 05:48

Fix TektonInstallerSet deadlock when resources have deletionTimestamp

a40b5fe

jkhelil force-pushed the fix_deadlock branch from 2188fed to a40b5fe Compare February 25, 2026 10:21

tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 25, 2026

tekton-robot assigned anithapriyanatarajan Mar 5, 2026

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2026

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2026

anithapriyanatarajan removed the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217

Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217
jkhelil wants to merge 1 commit intotektoncd:mainfrom
jkhelil:fix_deadlock

jkhelil commented Feb 14, 2026

Uh oh!

tekton-robot commented Feb 14, 2026

Uh oh!

jkhelil commented Feb 17, 2026

Uh oh!

anithapriyanatarajan commented Feb 17, 2026

Uh oh!

anithapriyanatarajan commented Feb 18, 2026

Uh oh!

anithapriyanatarajan Feb 18, 2026

Uh oh!

jkhelil commented Feb 18, 2026 •

edited

Loading

Uh oh!

anithapriyanatarajan commented Feb 24, 2026 •

edited

Loading

Uh oh!

anithapriyanatarajan commented Mar 5, 2026

Uh oh!

anithapriyanatarajan commented Mar 5, 2026

Uh oh!

anithapriyanatarajan commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jkhelil commented Feb 14, 2026

Changes

Root Cause

Submitter Checklist

Release Notes

Uh oh!

tekton-robot commented Feb 14, 2026

Uh oh!

jkhelil commented Feb 17, 2026

Uh oh!

anithapriyanatarajan commented Feb 17, 2026

Uh oh!

anithapriyanatarajan commented Feb 18, 2026

Uh oh!

anithapriyanatarajan Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

jkhelil commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anithapriyanatarajan commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anithapriyanatarajan commented Mar 5, 2026

Uh oh!

anithapriyanatarajan commented Mar 5, 2026

Uh oh!

anithapriyanatarajan commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkhelil commented Feb 18, 2026 •

edited

Loading

anithapriyanatarajan commented Feb 24, 2026 •

edited

Loading